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Aims 

The University of Arizona Artificial Intelligence Lab (AI Lab) Dark Web project 
is a long-term scientific research program that aims to study and understand the 
international terrorism (jihadist) phenomena via a computational, data-centric 
approach. We aim to collect “ALL” web content generated by international terrorist 
groups, including web sites, forums, chat rooms, blogs, social networking sites, 
videos, virtual world, etc. We have developed various multilingual data mining, text 
mining, and web mining techniques to perform link analysis, content analysis, web 
metrics (technical sophistication) analysis, sentiment analysis, authorship analysis, 
and video analysis in our research. The approaches and methods developed in this 
project contribute to advancing the field of Intelligence and Security Informatics 
(ISI). Such advances will help related stakeholders perform terrorism research and 
facilitate international security and peace. 

Dark Web research has been featured in many national, international and local 
press and media, including: National Science Foundation press, Associated Press, 
BBC, Fox News, National Public Radio, Science News, Discover Magazine, 
Information Outlook, Wired Magazine, The Bulletin (Australian), Australian 
Broadcasting Corporation, Arizona Daily Star, East Valley Tribune, Phoenix ABC 
Channel 15, and Tucson Channels 4, 6, and 9. As an NSF-funded research project, 
our research team has generated significant findings and publications in major com- 
puter science and information systems journals and conferences. We hope our 
research will help educate the next generation of cyber/Internet-savvy analysts and 
agents in the intelligence, justice, and defense communities. 

This monograph aims to provide an overview of the Dark Web landscape, sug- 
gest a systematic, computational approach to understanding the problems, and illus- 
trate research progress with selected techniques, methods, and case studies developed 
by the University of Arizona AI Lab Dark Web team members. 
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Audience 

This book aims to provide an interdisciplinary and understandable monograph about 
Dark Web research. We hope to bring useful knowledge to scientists, security pro- 
fessionals, counter-terrorism experts, and policy makers. The proposed work could 
also serve as a reference material or textbook in graduate level courses related to 
information security, information policy, information assurance, information sys- 
tems, terrorism, and public policy. 

The primary audience for the proposed monograph will include the following: 

• IT Academic Audience: College professors, research scientists, graduate students, 
and select undergraduate juniors and seniors in computer science, information 
systems, information science, and other related IT disciplines who are interested 
in intelligence analysis and data mining and their security applications. 

• Security Academic Audience: College professors, research scientists, graduate 
students, and select undergraduate juniors and seniors in political sciences, ter- 
rorism study, and criminology who are interested in exploring the impact of the 
Dark Web on society. 

• Security Industry Audience: Executives, managers, analysts, and researchers in 
security and defense industry, think tanks, and research centers that are actively 
conducting IT-related security research and development, especially using open 
source web contents. 

• Government Audience: Policy makers, managers, and analysts in federal, state, 
and local governments who are interested in understanding and assessing the 
impact of the Dark Web and their security concerns. 



Scope and Organization 

The book consists of three parts. In Part I, we provide an overview of the research 
framework and related resources relevant to intelligence and security informatics 
(ISI) and terrorism informatics. Part II presents ten chapters on computational 
approaches and techniques developed and validated in the Dark Web research. Part 
III presents nine chapters of case studies based on the Dark Web research approach. 
We provide a brief summary of each chapter below. 

Part I. Research Framework: Overview and Introduction 

• Chapter 1. Dark Web Research Overview 

The AI Lab Dark Web project is a long-term scientific research program that 
aims to study and understand the international terrorism (jihadist) phenomena 
via a computational, data-centric approach. We aim to collect “ALL” web con- 
tent generated by international terrorist groups, including web sites, forums, chat 
rooms, blogs, social networking sites, videos, virtual world, etc. We have devel- 
oped various multilingual data mining, text mining, and web mining techniques 
to perform link analysis, content analysis,web metrics (technical sophistication) 
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analysis, sentiment analysis, authorship analysis, and video analysis in our 
research. 

• Chapter 2. Intelligence and Security Informatics (ISI): Research Framework 

In this chapter we review the computational research framework that is adopted 
by the Dark Web research. We first present the security research context, fol- 
lowed by description of a data mining framework for intelligence and security 
informatics research. To address the data and technical challenges facing ISI, we 
present a research framework with a primary focus on KDD (Knowledge 
Discovery from Databases) technologies. The framework is discussed in the con- 
text of crime types and security implications. 

• Chapter 3. Terrorism Informatics 

In this chapter we provide an overview of selected resources of relevance to 
“Terrorism Informatics,” a new discipline that aims to study the terrorism phe- 
nomena with a data-driven, quantitative, and computational approach. We first 
summarize several critical books that lay the foundation for studying terrorism in 
the new Internet era. We then review important terrorism research centers and 
resources that are of relevance to our Dark Web research. 

Part II. Dark Web Research: Computational Approach and Techniques 

• Chapter 4. Forum Spidering 

In this study we propose a novel crawling system designed to collect Dark Web 
forum content. The system uses a human-assisted accessibility approach to gain 
access to Dark Web forums. Several URL ordering features and techniques 
enable efficient extraction of forum postings. The system also includes an incre- 
mental crawler coupled with a recall improvement mechanism intended to facili- 
tate enhanced retrieval and updating of collected content. 

• Chapter 5. Link and Content Analysis 

To improve understanding of terrorist activities, we have developed a novel 
methodology for collecting and analyzing Dark Web information. The methodol- 
ogy incorporates information collection, analysis, and visualization techniques, 
and exploits various web information sources. We applied it to collecting and 
analyzing information of selected jihad web sites and developed visualization of 
their site contents, relationships, and activity levels. 

• Chapter 6. Dark Network Analysis 

Dark networks such as terrorist networks and narcotics-trafhcking networks are 
hidden from our view yet could have a devastating impact on our society and 
economy. Based on analysis of four real-world “dark” networks, we found that 
these covert networks share many common topological properties with other 
types of networks. Their efficiency in communication and flow of information, 
commands, and goods can be tied to their small-world structures characterized 
by small average path length and high clustering coefficient. In addition, we 
found that because of the small-world properties dark networks are more vulner- 
able to attacks on the bridges that connect different communities than to attacks 
on the hubs. 
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• Chapter 7. Interactional Coherence Analysis 

Despite the rapid growth of text-based computer-mediated communication 
(CMC), its limitations have rendered the media highly incoherent. Interactional 
coherence analysis (ICA) attempts to accurately identify and construct interac- 
tion networks of CMC messages. In this study, we propose the Hybrid Interactional 
Coherence (HIC) algorithm for identification of web forum interaction. HIC uti- 
lizes both system features, such as header information and quotations, and lin- 
guistic features, such as direct address and lexical relation. Furthermore, several 
similarity-based methods, including a Lexical Match Algorithm (LMA) and a 
sliding window method, are utilized to account for interactional idiosyncrasies. 

• Chapter 8. Dark Web Attribute System 

In this study we propose a Dark Web Attribute System (DWAS) to enable quan- 
titative Dark Web content analysis from three perspectives: technical sophistica- 
tion, content richness, and web interactivity. Using the proposed methodology, 
we identified and examined the Internet usage of major Middle Eastern terrorist/ 
extremist groups. In our comparison of terrorist/extremist web sites to U.S. gov- 
ernment web sites, we found that terrorists/extremist groups exhibited levels of 
web knowledge similar to that of U.S. government agencies. Moreover, terror- 
ists/extremists had a strong emphasis on multimedia usage and their web sites 
employed significantly more sophisticated multimedia technologies than gov- 
ernment web sites. 

• Chapter 9. Authorship Analysis 

In this study we addressed the online anonymity problem by successfully apply- 
ing authorship analysis to English and Arabic extremist group web forum mes- 
sages. The performance impact of different feature categories and techniques 
was evaluated across both languages. In order to facilitate enhanced writing style 
identification, a comprehensive list of online authorship features was incorpo- 
rated. Additionally, an Arabic language model was created by adopting specific 
features and techniques to deal with the challenging linguistic characteristics of 
Arabic, including an elongation filter and a root clustering algorithm. 

• Chapter 10. Sentiment Analysis 

In this study the use of sentiment analysis methodologies is proposed for classi- 
fication of web forum opinions in multiple languages. The utility of stylistic and 
syntactic features is evaluated for sentiment classification of English and Arabic 
content. Specific feature extraction components are integrated to account for the 
linguistic characteristics of Arabic. The Entropy Weighted Genetic Algorithm 
(EWGA) is also developed, which is a hybridized genetic algorithm that incor- 
porates the information gain heuristic for feature selection. The proposed fea- 
tures and techniques are evaluated on U.S. and Middle Eastern extremist web 
forum postings. 

• Chapter 11. Affect Analysis 

Analysis of affective intensities in computer-mediated communication is important 
in order to allow a better understanding of online users’ emotions and preferences. 
In this study we compared several feature representations for affect analysis, 
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including learned n-grams and various automatically- and manually-crafted 
affect lexicons. We also proposed the support vector regression correlation 
ensemble (SVRCE) method for enhanced classification of affect intensities. 
Experiments were conducted on U.S. domestic and Middle Eastern extremist 
web forums. 

• Chapter 12. CyberGate Visualization 

Computer-mediated communication (CMC) analysis systems are important for 
improving participant accountability and researcher analysis capabilities. 
However, existing CMC systems focus on structural features, with little support 
for analysis of text content in web discourse. In this study we propose a frame- 
work for CMC text analysis grounded in Systemic Functional Linguistic Theory. 
Our framework addresses several ambiguous CMC text mining issues, including 
the relevant tasks, features, information types, feature selection methods, and 
visualization techniques. Based on it, we have developed a system called 
CyberGate, which includes the Writeprint and Ink Blot techniques. These tech- 
niques incorporate complementary feature selection and visualization methods 
in order to allow a breadth of analysis and categorization capabilities. 

• Chapter 13. Dark Web Forum Portal 

The Dark Web Forum Portal provides web-enabled access to critical interna- 
tional jihadist web forums. The focus of this chapter is on the significant exten- 
sions to previous work including: increasing the scope of our data collection; 
adding an incremental spidering component for regular data updates; enhancing 
the searching and browsing functions; enhancing multilingual machine transla- 
tion for Arabic, French, German and Russian; and advanced Social Network 
Analysis. A case study on identifying active jihadi participants in web forums is 
shown at the end. 

Part III. Dark Web Research: Case Studies 

• Chapter 14. Jihadi Video Analysis 

This chapter presents an exploratory study of jihadi extremist groups’ videos 
using content analysis and a multimedia coding tool to explore the types of video, 
groups’ modus operandi, and production features that lend support to extremist 
groups. The videos convey messages powerful enough to mobilize members, 
sympathizers, and even new recruits to launch attacks that are captured (on video) 
and disseminated globally through the Internet. The videos are important for 
jihadi extremist groups’ learning, training, and recruitment. In addition, the con- 
tent collection and analysis of extremist groups’ videos can help policy makers, 
intelligence analysts, and researchers better understand the extremist groups’ ter- 
ror campaigns and modus operandi, and help suggest counter-intelligence strate- 
gies and tactics for troop training. 

• Chapter 15. Extremist YouTube Videos 

In this study, we propose a text-based framework for video content classification 
of online video-sharing web sites. Different types of user-generated data (e.g., 
titles, descriptions, and comments) were used as proxies for online videos, and 
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three types of text features (lexical, syntactic, and content-specific features) were 
extracted. Three feature-based classification techniques (C4.5, Naive Bayes, and 
SVM) were used to classify videos. To evaluate the proposed framework, we 
developed a testbed based on jihadi videos collected from the most popular 
video-sharing site, YouTube. 

Chapter 16. Improvised Explosive Devices (IED) on Dark Web 

This chapter presents a cyber-archaeology approach to social movement research. 
Cultural cyber-artifacts of significance to the social movement are collected and 
classified using automated techniques, enabling analysis across multiple related 
virtual communities. Approaches to the analysis of cyber-artifacts are guided by 
perspectives of social movement theory. A Dark Web case study on a broad 
group of related IED virtual communities is presented to demonstrate the effi- 
cacy of the framework and provide a detailed instantiation of the proposed 
approach for evaluation. 

Chapter 17. Weapons of Mass Destruction (WMD) on Dark Web 

In this chapter we propose a research framework that aims to investigate the 
capability, accessibility, and intent of critical high-risk countries, institutions, 
researchers, and extremist or terrorist groups. We propose to develop a knowl- 
edge base of the Nuclear Web that will collect, analyze, and pinpoint significant 
actors in the high-risk international nuclear physics and weapons communities. 
We also identify potential extremist or terrorist groups from our Dark Web test- 
bed who might pose WMD threats to the U.S. and the international community. 
Selected knowledge mapping and focused web crawling techniques and findings 
from a preliminary study are presented. 

Chapter 18. Bioterrorism Knowledge Mapping 

In this research we propose a framework to identify the researchers who have 
expertise in the bioterrorism agents/diseases research domain, the major institu- 
tions and countries where these researchers reside, and the emerging topics and 
trends in bioterrorism agents/diseases research. By utilizing knowledge mapping 
techniques, we analyzed the productivity status, collaboration status, and emerg- 
ing topics in the bioterrorism domain. The analysis results provide insights into 
the research status of bioterrorism agents/diseases and thus allow a more com- 
prehensive view of bioterrorism researchers and ongoing work. 

Chapter 19. Women’s Forums on the Dark Web 

In this study, we develop a feature-based text classification framework to examine 
the online gender differences between female and male posters on web forums by 
analyzing writing styles and topics of interests. We examine the performance of 
different feature sets in an experiment involving political opinions. The results of 
our experimental study on this Islamic women’s political forum show that the 
feature sets containing both content-free and content-specific features perform 
significantly better than those consisting of only content-free features. 



Preface 



xi 



• Chapter 20. US Domestic Extremist Groups 

U.S. domestic extremist groups have increased in number and are intensively 
utilizing the Internet as an effective tool to share resources and members with 
limited regard for geographic, legal, or other obstacles. In this study, we develop 
automated and semi-automated methodologies for capturing, classifying, and 
organizing domestic extremist web site data. We found that by analyzing the 
hyperlink structures and content of domestic extremist web sites and construct- 
ing social network maps, their inter-organizational structure and cluster affinities 
could be identified. 

• Chapter 21. International Falun Gong Movement on the Web 

In this study, we developed a cyber-archaeology approach and used the interna- 
tional Falun Gong (FLG) movement as a case study. The FLG is known as a 
peaceful international social movement, unlike the more violent jihadi move- 
ment. We employed Social Network Analysis and Writeprint to analyze FLG’s 
cyber-artifacts from the perspectives of links, web content, and forum content. In 
the link analysis, FLG’s web sites linked closely to Chinese democracy and 
human rights social movement organizations (SMOs), reflecting FLG’s histori- 
cal conflicts with the Chinese government after the official ban in 1999. 

• Chapter 22. Botnets and Cyber Criminals 

In the last several years, the nature of computer hacking has completely changed. 
Cybercrime has risen to unprecedented sophistication with the evolution of bot- 
net technology, and an underground community of cyber criminals has arisen, 
capable of inflicting serious socioeconomic and infrastructural damage in the 
information age. This chapter serves as an introduction to the world of modern 
cybercrime and discusses information systems to investigate it. We investigated 
the command and control (C&C) signatures of major botnet herders using data 
collected from the ShadowServer Foundation, a nonprofit research group for bot- 
net research. We also performed exploratory population modeling of the bots and 
cluster analysis of selected cyber criminals. 
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Part I 

Research Framework: 
Overview and Introduction 



Chapter 1 

Dark Web Research Overview 



1 Introduction 

Gabriel Weimann of Haifa University in Israel estimated that there are about 5,000 
terrorist web sites as of 2006 (Weimann 2006). Based on our actual spidering expe- 
rience over the past 8 years, we believe there are about 100,000 sites of extremist 
and terrorist content as of 2010, including: web sites, forums, blogs, social network- 
ing sites, video sites, and virtual world sites (e.g.. Second Life). The largest increase 
since 2006-2007 is in various new Web 2.0 sites (forums, videos, blogs, virtual 
world, etc.) in different languages (i.e., for homegrown groups, particularly in 
Europe). We have found significant terrorism content in more than 15 languages. 

We collect (using computer programs) various web contents every 2-3 months; 
we started spidering in 2002. Currently, we only collect the complete contents of 
about 1,000 sites, in Arabic, Spanish, and English languages. We also have partial 
contents of about another 10,000 sites. In total, our collection is about 15 TBs in 
size, with close to 2,000,000,000 pages/files/postings from more than 10,000 sites. 
We believe our Dark Web collection is the largest open-source extremist and terror- 
ist collection in the academic world. Researchers can have graded access to our 
collection by contacting our research center. We present below a summary of impor- 
tant Dark Web contents. 



1.1 Web Sites 

Our web site collection consists of the complete contents of about 1,000 sites, in 
various static (html, pdf. Word) and dynamic (PHR JSP, CGI) formats. We collect 
every single page, link, and attachment within these sites. We also collect partial 
information from about 10,000 related (linked) sites. Some large well-known sites 
contain more than 10,000 pages/hies in 10+ languages (in selected pages). 
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1.2 Forums 

We collect the complete contents (authors, headings, postings, threads, time tags, 
etc.) of about 300 terrorist forums. We also perform periodic updates. Some large 
radical sites include more than 30,000 members with close to 1,000,000 messages 
posted. We have also developed the Dark Web Forum Portal, which provides beta 
search access to several international jihadist “Dark Web” forums collected by the 
Artificial Intelligence Lab at the University of Arizona. Users may search, view, 
translate, and download messages (by forum member name, thread title, topic, 
keyword, etc.). 

Preliminary social network analysis visualization is also available. 



1.3 Blogs, Social Networking Sites, and Virtual Worlds 

We have identified and extracted many smaller, transient (meaning the sites appear 
and disappear very quickly) blogs and social networking sites, mostly hosted by 
terrorist sympathizers and “wannabes.” We have also identified more than 30 (self- 
proclaimed) terrorist or extremist groups in virtual world sites. (However, we are 
still unsure whether they are “real” terrorist/extremists or just playing the roles in 
virtual games). 



1.4 Videos and Multimedia Content 

Terrorist sites are extremely rich in content, with heavy usage of multimedia for- 
mats. We have identified and extracted about 1,000,000 images and 100,000 videos 
from many terrorist sites and specialty multimedia file-hosting third-party servers. 
More than 50% of our videos are IED (Improvised Explosive Devices) related. 



2 Computational Techniques (Data Mining, 

Text Mining, and Web Mining) 

Our computational tools are grouped into two categories: (1) collection and (2) 
analysis and visualization. Significant deep web spidering, computational linguistic 
analysis, sentiment analysis, social network analysis, and social media analysis and 
visualization research has been conducted by members of the AI Lab over the past 
8 years. We summarize selected approaches below. More details about specific tech- 
niques and case studies will be presented in subsequent chapters. 
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2.1 Dark Web Collection 

2.1.1 Web Site Spidering 

We have developed various focused spiders/crawlers for collecting deep web 
content on our previous digital library research. Our spiders can access password- 
protected sites and perform randomized (human-like) fetching. Our spiders are 
trained to fetch all html, pdf, and Word files; links; PHP, CGI, and ASP files; images; 
audios; and videos in a web site. To ensure freshness, we spider selected web sites 
every 2-3 months. 



2.1.2 Forum Spidering 

Our forum spidering tool recognizes 15+ forum-hosting software and their formats. 
We collect the complete forum, including authors, headings, postings, threads, time 
tags, etc., which allows us to reconstruct participant interactions. We perform peri- 
odic forum spidering and incremental updates based on research needs. We have 
collected and processed forum contents in Arabic, English, Spanish, French, and 
Chinese using selected computational linguistics techniques. 



2.1.3 Multimedia (Image, Audio, and Video) Spidering 

We have developed specialized techniques for spidering and collecting multimedia 
files and attachments from web sites and forums. We plan to perform stenography 
research to identify encrypted images in our collection and multimedia analysis 
(video segmentation, image recognition, voice/speech recognition) to identify 
unique terrorist-generated video contents and styles. 



2.2 Dark Web Analysis and Visualization 

2.2.1 Social Network Analysis (SNA) 

We have developed various SNA techniques to examine web site and forum posting 
relationships. We have used various topological metrics (betweenness, degree, etc.) 
and properties (preferential attachment, growth, etc.) to model terrorist and terrorist 
site interactions. We have developed several clustering (e.g., blockmodeling) and 
projection (e.g., multidimensional scaling, spring embedder) techniques to visualize 
their relationships. Our focus is on understanding “Dark Networks” (unlike tradi- 
tional “bright” scholarship, e-mail, or computer networks) and their unique proper- 
ties (e.g., hiding, justice intervention, rival competition, etc.). 
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2.2.2 Content Analysis 

We have developed several detailed (terrorism-specific) coding schemes to analyze 
the contents of terrorist and extremist web sites. Content categories include: recruit- 
ing, training, sharing ideology, communication, propaganda, etc. We have also 
developed computer programs to help automatically identify selected content cate- 
gories (e.g., web master information, forum availability, etc.). 



2.2.3 Web Metrics Analysis 

Web metrics analysis examines the technical sophistication, media richness, and 
web interactivity of extremist and terrorist web sites. We examine technical features 
and capabilities (e.g., their ability to use forms, tables, CGI programs, multimedia 
files, etc.) of such sites to determine their level of “web-savviness.” Web metrics 
provides a measure for terrorists/extremists’ capability and resources. All terrorist 
site web metrics are extracted and computed using computer programs. 



2.2.4 Sentiment and Affect Analysis 

Not all sites are equally radical or violent. Sentiment (polarity: positive/negative) 
and affect (emotion: violence, racism, anger, etc.) analysis allows us to identify 
radical and violent sites that warrant further study. We also examine how radical 
ideas become “infectious” based on their contents and senders and their interac- 
tions. We rely heavily on recent advances in opinion mining - analyzing opinions in 
short web-based texts. We have also developed selected visualization techniques to 
examine sentiment/affect changes in time and among people. Our research includes 
several probabilistic multilingual affect lexicons and selected dimension reduction 
and projection (e.g., principal component analysis) techniques. 



2.2.5 Authorship Analysis and Writeprint 

Grounded in authorship analysis research, we have developed the (cyber) Writeprint 
technique to uniquely identify anonymous senders based on the signatures associ- 
ated with their forum messages. We expand the lexical and syntactic features of 
traditional authorship analysis to include system (e.g., font size, color, web links) 
and semantic (e.g., violence, racism) features of relevance to online texts of extrem- 
ists and terrorists. We have also developed advanced Ink blot and Writeprint visual- 
izations to help visually identify web signatures. Our Writeprint technique has been 
developed for Arabic, English, and Chinese languages. The Arabic Writeprint con- 
sists of more than 400 features, all automatically extracted from online messages 
using computer programs. Writeprint can achieve an accuracy level of 95%. 
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2.2.6 Video Analysis 

Based on previous terrorism ontology research, we have developed a unique coding 
scheme to analyze terrorist-generated videos based on the contents, production 
characteristics, and metadata associated with the videos. We have also developed a 
semiautomated tool to allow human analysts to quickly and accurately analyze and 
code these videos. 



2.2.7 IEDs in Dark Web Analysis 

We have conducted several systematic studies to identify IED-related content gen- 
erated by terrorist and insurgency groups in the Dark Web. A smaller number of 
sites are responsible for distributing a large percentage of IED-related web pages, 
forum postings, training materials, explosive videos, etc. We have developed unique 
signatures for those IED sites based on their contents, linkages, and multimedia hie 
characteristics. Much of the content needs to be analyzed by military analysts. 
Training materials also need to be developed for troops before their deployment 
(“seeing the battlefield from your enemies’ eyes”). 

2.2.8 Dark Web Forum Portal 

For several years, we have monitored and collected many international jihadist 
forums. These online discussion sites are dedicated to topics relating primarily to 
Islamic ideology and theology. The Lab now provides search access to these forums 
through its Dark Web Forum Portal, and in its beta form, the portal provides access 
to 29 forums, which together comprise nearly 13,000,000 messages from 340,000 
participants in four different languages (English, Arabic, German, and Russian). 
The Portal also provides statistical analysis, download, translation, and social 
network visualization functions for each selected forum. It is accessible at 
http://128.196.40.222:8080/CRI_Indexed_new/index.jsp. 



3 Dark Web Project Structure and Resources 

The Dark Web project is a multiyear, multiinstitutional, and multidisciplinary 
research program that spans information systems, computer science, terrorism 
study, intelligence analysis, international relations, etc. Many AI Lab researchers 
and students have helped contribute to the success of the project. 
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3.1 Team Members (Selected) 

Dr. Hsinchun Chen is project PI and director of the Artificial Intelligence Lab. He is 
in charge of all aspects of the Dark Web project. Ms. Cathy Larson is AI Lab associate 
director and a Dark Web project lead, in charge of project coordination, partnership, 
and user studies. Selected Dark Web research team members and their expertise 
include (in alphabetical order): 

• Dr. Ahmed Abbasi, affect analysis and visualization 

• Enrique Arevelo, system development 

• Alfonso A. Bonillas, system development and Spanish content collection 

• Yida Chen, social network analysis 

• Dr. Wingyan Chung, web portal development and evaluation 

• Oscar de Ita, system development and Spanish content collection 

• Carrie Fang, database development 

• Tianjun Fu, forum spidering and coherence analysis 

• Dr. Sidd Kaza, network analysis 

• Dr. Danning Hu, dynamic network analysis 

• Dr. Guanpi Lai (Greg), web portal development 

• Dr. Jiexun Jason Li, English and Chinese authorship analysis 

• Dr. Dan McDonald, computational linguistic analysis 

• Dr. Jialun Qin, web attribute system research 

• Dr. Edna Reid, terrorism knowledge mapping 

• Arab Salim, video analysis and Arabic language processing 

• Lu Tseng, system lead 

• Shing Ka Wu, system interface design 

• Wei Xi, system development 

• Dr. Jennifer Jie Xu, Dark Network analysis and visualization 

• Lijun Yan, system development 

• Dr. Rong Zheng, English and Chinese authorship analysis 

• Dr. Yilu Zhou, computational linguistic analysis 



3.2 Press Coverage and Interest 

Dark Web research has been featured in many national, international, and local 
press and media, including: Associated Press, USA Today, The Economist, NSF 
Press, Washington Post, Fox News, BBC, PBS, Business Week, National Public 
Radio, Science News, Discover Magazine, WIRED Magazine, Government 
Computing Week, Second German TV (ZDF), Toronto Star, Bulletin (Australian), 
Australian Broadcasting Corporation, Arizona Daily Star, East Valley Tribune, 
Phoenix ABC Channel 15, Tucson TV Channels 4, 6, and 9, among others. It has 
been considered a model of advanced computational research of significant societal 
and international relevance. 
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As an NSF-funded research project, our research team has generated significant 
findings and publications in major computer science and information systems jour- 
nals and conferences. However, we have taken great care not to reveal sensitive 
group information or technical implementation details. 

We also wish to make some comments regarding civil liberties and human rights, 
based on concerned comments made by readers after hearing about the Dark Web 
project in the press. A few readers and reporters have cautioned about potential 
misuse of Dark Web contents by government agencies and authorities. 

The Dark Web project is unlike Total Information Awareness (TIA). This is not 
a secretive government project conducted by spooks. We perform scientific, longi- 
tudinal, hypothesis-guided terrorism research like other terrorism researchers. 
However, we are clearly more computationally oriented, unlike other traditional 
terrorism research that relies on sociology, communications, and policy-based 
methodologies. Our contents are open-source in nature (similar to Google’s con- 
tents), and our major research targets are international, jihadist groups, not regular 
US citizens. Our researchers are primarily computer and information scientists from 
all over the world. We develop computer algorithms, tools, and systems. Our 
research goal is to study and understand the international extremism and terrorism 
phenomena and the associated web-enabled “social movements’’ in some regions 
and communities. Some people may refer to this as understanding the “root cause 
of terrorism.” It can also be considered a “soft power” approach to understanding 
the social and geopolitical landscape of the Leaderless Jihad. 



3.3 The IEEE Intelligence and Security 
Informatics Conference 

The Dark Web project has been frequently reported in the major IEEE ISI confer- 
ence, held annually in the USA and internationally. The ISI conference was initi- 
ated by Dr. Hsinchun Chen in 2003 with initial funding support from various 
government agencies, including NSF, DHS, CIA, and DOJ. Since then, the meeting 
has been held in Tucson, San Diego, Atlanta, New Brunswick, Taipei (Taiwan), 
Dallas, and Vancouver (Canada) (see http://ai.eller.arizona.edu/news/events.asp). 
The IEEE ISI Conference has also been expanded to Asia (Pacific Asia ISI, held in 
Singapore; Chengdu, China; Taipei, Taiwan; Bangkok, Thailand; and Hyderabad, 
India) and Europe (EuroISI held in Denmark). The conference typically draws 
100-200 participants in academia (social and computer sciences), industry, and 
governments who are interested in adopting advanced IT solutions to security prob- 
lems. More recently, a new Springer journal, Security Informatics (SI), has been 
created to publish high-quality and high-impact research in the field (see: http://ai. 
eller.arizona.edu/ISI/index.asp). Dr. Hsinchun Chen is serving as editor in chief of 
Springer SI. 
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3.4 Dark Web Publications 

The Dark Web team has been extremely productive in generating and sharing sig- 
nificant scientific findings and academic publications in books, proceedings, and 
journal and conference papers over the past 8 years. We include significant publica- 
tions below such that readers can find more information about our project. Selected 
research findings are also summarized in subsequent chapters, with more details on 
approaches, findings, and references. 

Books ( Monograph , Edited Volumes, and Proceedings): Most Dark Web research has 
been published in the IEEE ISI conference proceedings and selected edited volumes. 

• C. Yang, M. Chau, J. Wang, and H. Chen (Eds.), “Security Informatics," Annals 
of Information Systems, Springer, 2010. 

• H. Chen, M. Dacier, et al. (Eds.), Proceedings of the ACM SIGKDD Workshop 
on CyberSecurity and Intelligence Informatics, Paris, France, June 2009. 

• D. Zeng, L. Khan, L. Zhou, M. Day, C. Yang, B. Thuraisingham, and H. Chen 
(Eds.), Proceedings of the 2009 IEEE International Conference on Intelligence 
and Security Informatics, ISI 2009, Dallas, Texas, June 2009. 

• H. Chen, C. Yang, M. Chau, and S. Li (Eds.), Intelligence and Security Informatics, 
Proceedings of the Pacific-Asia Workshop, PAISI 2009, Bangkok, Thailand, 
Lecture Notes in Computer Science (LNCS 5477), Springer- Verlag, 2009. 

• C. Yang, H. Chen, et al., “Intelligence and Security Informatics,” IEEE ISI 2008 
International Workshops: PAISI, PACCF, and SOCO, Taipei, Taiwan, June 2008, 
Proceedings, Lecture Notes in Computer Science (LNCS 5075), Springer- Verlag, 
2008. 

• D. Zeng, H. Chen, H. Rolka, and B . Lober (Eds.), Biosurveillance and BioSecurity, 
International Workshop, BioSecure 2008, Springer- Verlag, December 2008. 

• H. Chen and C. Yang (Eds.), Intelligence and Security Informatics: Techniques 
and Applications, Springer, 2008. 

• H. Chen, E. Reid, J. Sinai, A. Silke, and B. Ganor (Eds.), Terrorism Informatics: 
Knowledge Management and Data Mining for Homeland Security, Springer, 2008. 

• H. Chen, T. S. Raghu, R. Ramesh, A. Vinze, and D. Zeng (Eds.), Handbooks in 
Information Systems - National Security, Elsevier Scientific, 2007. 

• C. Yang, D. Zeng, M. Chau, K. Chang, Q. Yang, X. Cheng, J. Wang, F. Wang, 
and H. Chen (Eds.), Intelligence and Security Informatics, Proceedings of the 
Pacific-Asia Workshop, PAISI 2007, Lecture Notes in Computer Science (LNCS 
4430), Springer- Verlag, 2007. 

• S. Mehrotra, D. Zeng, H. Chen, B. Thursaisingham, and F. Wang (Eds.), 
Intelligence and Security Informatics, Proceedings of the IEEE International 
Conference on Intelligence and Security Informatics, ISI 2006, Lecture Notes in 
Computer Science (LNCS 3975), Springer- Verlag, 2006. 

• H. Chen, F. Wang, C. Yang, D. Zeng, M. Chau, and K. Chang (Eds.), Intelligence 
and Security Informatics, Proceedings of the Workshop on Intelligence and 
Security Informatics, WISI 2006, Lecture Notes in Computer Science (LNCS 
3917), Springer- Verlag, 2006. 
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• H. Chen, Intelligence and Security Informatics for International Security: 
Information Sharing and Data Mining, Springer, 2006. 

• P. Kantor, G. Muresan, F. Roberts, D. Zeng, F. Wang, H. Chen, and R. Merkle 
(Eds.), Intelligence and Security Informatics, Proceedings of the IEEE 
International Conference on Intelligence and Security Informatics, ISI 2005, 
Lecture Notes in Computer Science (LNCS 3495), Springer- Verlag, 2005. 

• H. Chen, R. Moore, D. Zeng, and J. Leavitt (Eds.), Intelligence and Security 
Informatics, Proceedings of the Second Symposium on Intelligence and Security 
Informatics, ISI 2004, Lecture Notes in Computer Science (LNCS 3073), 
Springer- Verlag, 2004. 

• H. Chen, R. Miranda, D. Zeng, T. Madhusudan, C. Demchak, and J. Schroeder 
(Eds.), Intelligence and Security Informatics, Proceedings of the First NSF/NIJ 
Symposium on Intelligence and Security Informatics, ISI 2003, Lecture Notes in 
Computer Science (LNCS 2665), Springer- Verlag, 2003. 

Journal Articles (Published and Forthcoming): Selected Dark Web research has 

been published in major, SCI-indexed IT journals. 



2010 

• D. Zimbra, A. Abbasi, and H. Chen, “A Cyber-archeology Approach to Social 
Movement Research: Framework and Case Study,” Journal of Computer- 
Mediated Communication, forthcoming, 2010. 

• J. Xu, D. Hu, and H. Chen, “Dynamics of Terrorist Networks,” Journal of 
Homeland Security and Emergency Management, forthcoming, 2010. 

• A. Abbasi, H. Chen, and Z. Zhang, “Selecting Attributes for Sentiment 
Classification using Feature Relation Networks,” IEEE Transactions on 
Knowledge and Data Engineering, forthcoming, 2010. 

• T. J. Fu, A. Abbasi, and H. Chen, “A Focused Crawler for Dark Web Forums,” 
Journal of the American Society for Information Science and Technology, Volume 
61, Number 6, Pages 1213-1231, 2010. 



2009 

• Y. Chen, A. Abbasi, and H. Chen, “Framing Social Movement Identity with 
Cyber- Artifacts: A Case Study of the International Falun Gong Movement,” 
Annals of Information Systems, Volume 9, Pages 1-24, 2009. 

• Y. Dang, Y. Zhang, H. Chen, P. Hu, S. Brown, and C. Larson, “Arizona Literature 
Mapper: An Integrated Approach to Monitor and Analyze Global Bioterrorism 
Research Literature,” Journal of the American Society for Information Science 
and Technology, Volume 60, Number 7, Pages 1466-1485, July 2009. 

• D. Hu, S. Kaza, and H. Chen, “Identifying Significant Facilitators of Dark 
Network Evolution,” Journal of the American Society for Information Science 
and Technology , Volume 60, Number 4, Pages 655-665, April 2009. 
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2008 

• A. Abbasi and H. Chen, “CyberGate: A System and Design Frame-work for Text 
Analysis of Computer Mediated Communication,” MIS Quarterly (MISQ), 
Special Issue on Design Science Research, Vol. 32, No. 4, Pages 811-837, 
December 2008. 

• A. Abbasi, H. Chen, S. Thoms, T. Fu, “Affect Analysis of Web Forums and Blogs 
Using Correlation Ensembles,” IEEE Transactions on Knowledge and Data 
Engineering, Volume 20, Number 9, Pages 1168-1180, September 2008. 

• H. Chen, W. Chung, J. Qin, E. Reid, M. Sageman, and G. Weinmann, “Uncovering 
the Dark Web: A Case Study of Jihad on the Web,” Journal of the American 
Society for Information Science and Technology, Volume 59, Number 8, Pages 
1347-1359, 2008. 

• A. Abbasi, H. Chen, H. A. Salem, “Sentiment Analysis in Multiple Languages: 
Feature Selection for Opinion Classification in Web Forums,” ACM Transactions 
on Information Systems, Vol. 26, No. 3, Article 12, June 2008. 

• A. Abbasi and H. Chen, “Writeprints: A Stylometric Approach to Identity-Level 
Identification and Similarity Detection in Cyberspace,” ACM Transactions on 
Information Systems, Vol. 26, No. 2, Article 7, March 2008. 



2007 

• E. Reid and H. Chen, “Mapping the Contemporary Terrorism Research Domain,” 
International Journal of Human-Computer Studies, 65, Pages 42-56, 2007. 

• J. Qin, Y. Zhou, E. Reid, G. Lai, and H. Chen, “Analyzing Terror Campaigns on 
the Internet: Technical Sophistication, Content Richness, and Web Interactivity,” 
International Journal of Human-Computer Studies, 65, Pages 7 1-84, 2007. 

• E. Reid and H. Chen, “Internet-Savvy U.S. and Middle Eastern Extremist Groups,” 
Mobilization: An International Quarterly, 12(2), Pages 177-192, 2007. 
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Table 1.1 Funding of Dark Web research projects (selected) 



Agency and project title 


Funding period 
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Security Informatics Research (partial support) (NSF # EIA-0326348) 


Sept 2003-Aug 2011 


Air Force Research Lab 

- Dark Web WMD-Terrorism Study (Subcontract No FA8650-02) 


Aug 2008-May 2009 


Defense Threat Reduction Agency 

- WMD Intent Identification and Interaction Analysis Using the Dark 
Web (HDTRA 1-09- 1-0058) 


Jul 2009-Jul 2012 


Dept, of Homeland Security/CNRI 
- Border Safe Initiative (partial support) 


Oct 2003-Sept 2005 


Library of Congress 

- Capture of Multimedia, Multilingual Open Source Web-based 
At-Risk Content 


Jul 2005-Jun 2008 
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Chapter 2 

Intelligence and Security Informatics (ISI) 
Research Framework 



1 Information Technology and National Security 

The tragic events of September 11 and the following anthrax contamination of 
letters caused drastic effects on many aspects of society. Terrorism became the most 
significant threat to national security because of its potential to bring massive dam- 
age to our infrastructure, economy, and people. 

In response to this challenge, federal authorities are actively implementing com- 
prehensive strategies and measures in order to achieve the three objectives identi- 
fied in the “National Strategy for Homeland Security” report (Office of Homeland 
Security 2002): (1) preventing future terrorist attacks, (2) reducing the nation’s 
vulnerability, and (3) minimizing the damage and recovering from attacks that 
occur. State and local law enforcement agencies, likewise, are becoming more vigi- 
lant about the criminal activities that harm public safety and threaten national 
security. 

Academics in the fields of natural sciences, computational science, information 
science, social sciences, engineering, medicine, and many others have been called 
upon to help enhance the government’s ability to fight terrorism and other crimes. 
Science and technology have been identified in the “National Strategy for Homeland 
Security” report as the keys to win the new counterterrorism war (Office of 
Homeland Security 2002). It is widely believed that information technology will 
play an indispensable role in making our nation safer (National Research Council 
2002) by supporting intelligence and knowledge discovery through collecting, pro- 
cessing, analyzing, and utilizing terrorism- and crime-related data (Chen 2006). 
Based on the crime and intelligence knowledge discovered, the federal, state, and 
local authorities can make timely decisions to select effective strategies and tactics 
as well as allocate the appropriate amount of resources to detect, prevent, and 
respond to future attacks. 

Six critical mission areas have been identified where information technology can 
contribute to the accomplishment of the three strategic national security objectives 
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identified in the “National Strategy for Homeland Security” report (Office of 

Homeland Security 2002): 

• Intelligence and warning. Although terrorism depends on surprise to damage its 
targets (Office of Homeland Security 2002), terrorist activities are not random 
and impossible to track. Terrorists must plan and prepare before the execution of 
an attack by selecting a target, recruiting and training executors, acquiring finan- 
cial support, and traveling to the country where the target is located (Sageman 
2004). To avoid being preempted by authorities, they may hide their true identi- 
ties and disguise attack-related activities. Similarly, criminals may use falsified 
identities during police contacts (Wang et al. 2004a). Although it is difficult, 
detecting potential terrorist attacks or crimes is possible and feasible with the 
help of information technology. By analyzing the communication and activity 
patterns among terrorists and their contacts (i.e., terrorist networks), detecting 
deceptive identities, or employing other surveillance and monitoring techniques, 
intelligence and warning systems may issue timely, critical alerts and warnings 
to prevent attacks or crimes from occurring. 

• Border and transportation security. Terrorists enter a targeted country through an 
air, land, or sea port of entry. Criminals in narcotics rings travel across borders to 
purchase, carry, distribute, and sell drugs. Information, such as travelers’ identi- 
ties, images, fingerprints, vehicles used, and other characteristics, is collected from 
customs, border, and immigration authorities on a daily basis. Counterterrorism 
and crime-fighting capabilities can be greatly improved by the creation of a “smart 
border,” where information from multiple sources is shared and analyzed to help 
locate wanted terrorists or criminals. Technologies such as information sharing 
and integration, collaboration and communication, biometrics, and image and 
speech recognition will be greatly needed in such smart borders. 

• Domestic counterterrorism. As terrorists, both international and domestic, may 
be involved in local crimes, state and local law enforcement agencies are also 
contributing to the missions by investigating and prosecuting crimes. Terrorism, 
like gangs and narcotics trafficking, is regarded as a type of organized crime in 
which multiple offenders cooperate to carry out offenses. Information technolo- 
gies that help find cooperative relationships between criminals and their interac- 
tive patterns would also be helpful for analyzing terrorism. Monitoring activities 
of domestic terrorist and extremist groups using advanced information technolo- 
gies will also be helpful to public safety personnel and policy makers. 

• Protecting critical infrastructure and key assets. Roads, bridges, water supplies, 
and many other physical service systems are critical infrastructure and key assets 
of a nation. They may become the target of terrorist attacks because of their vul- 
nerabilities (Office of Homeland Security 2002). Moreover, virtual (cyber) infra- 
structure such as the Internet may also be vulnerable to intrusions and inside 
threats. Criminals and terrorists are increasingly using cyberspace to conduct 
illegal activities, share ideology, solicit funding, and recruit. In addition to physi- 
cal devices such as sensors and detectors, advanced information technologies are 
needed to model the normal behaviors of the usage of these systems and then use 
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the models to distinguish abnormal behaviors from normal behaviors. Protective 
or reactive measures can be selected based on the results to secure these assets 
from attacks. 

• Defending against catastrophic terrorism. Terrorist attacks can cause devastating 
damage to a society through the use of chemical, biological, or nuclear weapons. 
Biological attacks, for example, may cause contamination, infectious disease 
outbreaks, and significant loss of life. Information systems that can efficiently 
and effectively collect, access, analyze, and report data about catastrophe-leading 
events can help prevent, detect, respond to, and manage these attacks. 

• Emergency preparedness and responses. In case of a national emergency, prompt 
and effective responses are critical to reduce the damage resulting from an attack. 
In addition to the systems that are prepared to defend against catastrophes, infor- 
mation technologies that help design and experiment with optimized response 
plans, identify experts, train response professionals, and manage consequences 
are beneficial in the long run. Moreover, information systems that facilitate social 
and psychological support to the victims of terrorist attacks can also help society 
recover from disasters. 

Although it is important for the critical missions of national security, the devel- 
opment of information technology for counterterrorism and crime-fighting applica- 
tions faces many problems and challenges. 



1.1 Problems and Challenges 

Currently, intelligence and security agencies are gathering large amounts of data 
from various sources. Processing and analyzing such data, however, has become 
increasingly difficult. By treating terrorism as a form of organized crime, these chal- 
lenges can be categorized into three types: 

• Characteristics of criminals and crimes. Some crimes may be geographically 
diffused and temporally dispersed. In organized crimes such as transnational nar- 
cotics trafficking, criminals often live in different countries, states, and cities. 
Drug distribution and sales occur in different places at different times. Similar 
situations exist in other organized crimes (e.g., terrorism, armed robbery, and 
gang-related crime). As a result, an investigation must cover multiple offenders 
who commit criminal activities in different places at different times. This can be 
fairly difficult given the limited resources that intelligence and security agencies 
have. Moreover, as computer and Internet technologies advance, criminals are 
utilizing cyberspace to commit various types of cybercrimes under the guise of 
ordinary online transactions and communications. 

• Characteristics of security- and intelligence-related data. A significant source of 
challenge is information stovepipe and overload resulting from diverse data 
sources, multiple data formats, and large data volumes. Unlike other domains 
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such as marketing, finance, and medicine in which data can be collected from 
particular sources (e.g., sales records from companies, patient medical history 
from hospitals), the intelligence and security domain does not have a well-defined 
data source. Both authoritative information (e.g., crime incident reports, tele- 
phone records, financial statements, immigration and customs records) and open 
source information (e.g., news stories, journal articles, books, web pages) need 
to be gathered for investigative purposes. Data collected from these different 
sources often are in different formats ranging from structured database records to 
unstructured text, image, audio, and video files. Important information such as 
criminal associations may be available but contained in unstructured, multilin- 
gual texts and remains difficult to access and retrieve. Moreover, as data volumes 
continue to grow, extracting valuable and credible intelligence and knowledge 
becomes a difficult problem. 

• Characteristics of security and intelligence analysis techniques. Current research 
on the technologies for counterterrorism and crime-fighting applications lacks a 
consistent framework addressing the major challenges. Some information tech- 
nologies, including data integration, data analysis, text mining, image and video 
processing, and evidence combination, have been identified as being particularly 
helpful (National Research Council 2002). However, the question of how to 
employ them in the intelligence and security domain and use them to effectively 
address the critical mission areas of national security remains unanswered. 

Facing the critical missions of national security and various data and techni- 
cal challenges, we believe there is a pressing need to develop the science of 
“Intelligence and Security Informatics” (ISI) (Chen 2006), with its main objective 
being the “development of advanced information technologies, systems, algorithms, 
and databases for national security-related applications, through an integrated 
technological, organizational, and policy-based approach.” 



1.2 Intelligence and Security Informatics Versus Biomedical 
Informatics: Emergence of a Discipline 

Comparing ISI with biomedical informatics, an established academic discipline 
addressing information management issues in biological and medical applications 
(Shortliffe and Blois 2000; Chen et al. 2005), we found tremendous analogies 
between these two disciplines. Table 2. 1 summarizes the similarities and differences 
between ISI and biomedical informatics. 

In terms of data characteristics, they both face the information stovepipe and 
information overload problem. In terms of technology development, they both are 
searching for new approaches, methods, and innovative use of existing techniques. 
In terms of scientific contributions, they both may add new insights and knowledge 
to various academic disciplines. 
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Table 2.1 Analogies between ISI and biomedical informatics 

Biomedical informatics ISI 



Challenges Domain- 

specific 



Data 



Technology 



Methodology 

Contributions Scientific 



Practical 



• Complexity and 
uncertainty associated 
with organisms 

and diseases 

• Critical decisions 
regarding patient 
well-being and 
biomedical discoveries 



• Information stovepipe 
and overload 

• HL7 XML standard 

• PHIN MS messaging 

• Patient records, disease 
data, medical images 

• Ontologies and linguistic 
parsing 

• Information integration 

• Data and text mining 

• Medical decision support 
systems and techniques 

KDD 

• Computer and 
information science, 
sociology, policy, legal 

• Clinical medicine 
and biology 

• Public health 

• Patient well-being 

• Biomedical treatment 
and discovery 



• Geographically diffused 
and temporally dispersed 
organized crimes 

• Cybercrimes 
on the Internet 



• Critical decisions related 
to public safety and 
homeland security 

• Information stovepipe 
and overload 

• Justice XML standard 

• Criminal incident 
records 

• Multilingual intelligence 
open sources 

• Information integration 

• Criminal network 
analysis 

• Data, text, and web 
mining 

• Identity management 
and deception detection 

KDD 

• Computer and 
information science, 
sociology, policy, legal 

• Criminology, terrorism 
research 

• Crime investigation 
and counterterrorism 

• National and homeland 
security 



Most importantly, as a consistent research framework based on knowledge 
management and data mining has begun to emerge in biomedical informatics (Chen 
et al. 2005), ISI also needs a framework to guide its research. Facing the unique 
challenges (and associated opportunities) of information overload and the pressing 
need for advanced criminal and intelligence analyses and investigations, we believe 
that the Knowledge Discovery from Databases (KDD) methodology (Fayyad and 
Uthurusamy 2002), which has achieved significant success in other information- 
intensive, knowledge-critical domains including business, engineering, biology, and 
medicine, could be critical in addressing the challenges and problems facing ISI. 
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1.3 Research Opportunities 

The emergence of a new discipline such as ISI would require careful cultivation and 
development by many top-notch researchers and practitioners from many different 
disciplines, including (but not limited to): computer science, information science, 
information systems, electrical engineering, social science, law, public policy, crim- 
inal justice, terrorism research, psychology, behavioral and economic sciences, 
management science, bioinformatics, public health, etc. There is an abundance of 
opportunities for developing new and innovative funded ISI-related projects. 

Regardless of which funding programs you may be considering for your research, 
there are some common characteristics among successfully funded and (eventually, 
after execution) high-impact projects: 

• Unique and critical scientific or engineering innovations : You need to clearly 
distinguish your research from others. 

• Important problems and significant partners: You need to address important 
national security problems and demonstrate your commitment to address these 
problems with the help and support of your local, state, and federal agency 
partners. 

• From small to large: Most funded projects began humbly with proof-of-concept 
level funding. 

• A multidisciplinary team: After initial success, a multidisciplinary team of com- 
puter scientists, system developers, social scientists, policy and legal experts, 
domain (intelligence and security) experts, and such will be needed to implement 
a full-scale, multiphased, complex national security-related project. 

• Aim high: The following tangible (but somewhat lofty) project goals are always 
good for your project team to aim at: (1) publishing your project findings in 
Science, Nature, or Proceedings of the Academy of Science (for its scientific 
contributions) and (2) being featured in a New York Times or USA Today front- 
page article (for its societal impact). 



2 ISI Research Framework 

Crime is an act or the commission of an act that is forbidden, or the omission of a 
duty that is commanded by a public law and that makes the offender liable to pun- 
ishment by that law. The more threat a crime type poses on public safety, the more 
likely it is to be of national security concern. Some crimes such as traffic violations, 
theft, and homicide are mainly in the jurisdiction of local law enforcement agencies. 
Some other crimes need to be dealt with by both local law enforcement and national 
security authorities. Identity theft and fraud, for instance, are relevant at both the 
local and national level - criminals may escape arrest by using false identities; drug 
smugglers may enter the United States by holding counterfeited passports or visas. 
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Table 2.2 Crime types and security concerns 





Crime type 


Local law enforcement level 


National security level 




Traffic violations 


Driving under influence 
(DUI), fatal/personal 
injury/property 
damage, traffic 
accident, road rage 






Sex crime 


Sexual offenses, 
sexual assaults, 
child molesting 


Organized prostitution, people 
smuggling 


<L> 

O 


Theft 


Robbery, burglary, larceny, 
motor vehicle theft, 
stolen property 


Theft of national secrets 
or weapon information 


<D 

3 


Fraud 


Forgery and counterfeiting, 


Transnational money 


_C 




fraud, embezzlement, 


laundering, identity fraud, 


3 

3 




identity deception 


transnational financial 
fraud 


too 

tzi 


Arson 


Arson on buildings, 
apartments 


- 


<D 

Vh 


Organized crime 


Narcotic drug offenses 


Transnational drug 


e 




(sales or possession), 
gang-related offenses 


trafficking, terrorism 
(bioterrorism, bombing, 


1 


r 




hijacking, etc.) 




Violent crime 


Criminal homicide, armed 
robbery, aggravated 
assault, other assaults 


Terrorism 




Cybercrime 


Internet fraud (e.g., credit card fraud, advance fee fraud, 
fraudulent web sites), illegal trading, network intrusion/ 
hacking, virus spreading, hate crimes, cyber-piracy, 
cyber-pornography, cyber-terrorism, theft of confidential 
information 



Organized crimes, such as terrorism and narcotics trafficking, are often diffuse 
geographically, resulting in common security concerns across cities, states, and 
countries. Cybercrimes can pose threats to public safety across multiple jurisdic- 
tional areas due to the widespread nature of computer networks. 

Table 2.2 summarizes the different types of crimes sorted by the degree of their 
respective public influence. International and domestic terrorism, in particular, often 
involves multiple crime types (e.g., identity theft, money laundering, arson and bomb- 
ing, organized and violent activities, and cyber-terrorism) and causes great damage. 

We believe that KDD techniques can play a central role in improving counterter- 
rorism and crime-fighting capabilities of intelligence, security, and law enforcement 
agencies by reducing the cognitive and information overload. Knowledge discovery 
refers to nontrivial extraction of implicit, previously unknown, and potentially use- 
ful knowledge from data. Knowledge discovery techniques promise easy, conve- 
nient, and practical exploration of very large collections of data for organizations 
and users and have been applied in marketing, finance, manufacturing, biology, and 
many other domains (e.g., predicting consumer behaviors, detecting credit card 
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Fig. 2.1 A knowledge discovery research framework for ISI 



frauds, or clustering genes that have similar biological functions) (Fayyad and 
Uthurusamy 2002). Traditional knowledge discovery techniques include associa- 
tion rules mining, classification and prediction, cluster analysis, and outlier analy- 
sis. As natural language processing (NLP) research advances, (multilingual) text 
mining approaches that automatically extract, summarize, categorize, and translate 
text documents have also been widely used (Chen 2006). 

Many of these KDD technologies could be applied in ISI studies. Keeping in mind 
the special characteristics of crimes, criminals, and security-related data, we catego- 
rize existing ISI technologies into six classes: information sharing and collabora- 
tion, crime association mining, crime classification and clustering, intelligence text 
mining, spatial and temporal crime mining, and criminal network mining. 

These six classes are grounded on traditional knowledge discovery technologies 
with a few new approaches added, including spatial and temporal crime pattern 
mining and criminal network analysis, which are more relevant to counterterrorism 
and crime investigation. Although information sharing and collaboration are not 
data mining per se, they help prepare, normalize, warehouse, and integrate data for 
knowledge discovery and thus are included in the framework. 

In Fig. 2.1, we present our proposed research framework, with the horizontal 
axis being the crime types and the vertical axis being the six classes of techniques 
(Chen 2006). The shaded regions on the chart show promising research areas, i.e., 
that a certain class of techniques is relevant to solving a certain type of crime. Note 
that more serious crimes may require a more complete set of knowledge discovery 
techniques. For example, the investigation of organized crimes such as terrorism 
may depend on criminal network analysis technology, which requires the use of 
other knowledge discovery techniques such as association mining and clustering. 
An important observation about this framework is that the high-frequency occur- 
rences and strong association patterns of severe and organized crimes such as terror- 
ism and narcotics present a unique opportunity and potentially high rewards for 
adopting such a knowledge discovery framework. 
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Several unique classes of data mining techniques are of great relevance to ISI 
research. Text mining is critical for extracting key entities (people, places, narcotics, 
weapons, time, etc.) and their relationships presented in voluminous police incident 
reports, intelligence reports, open source news clips, etc. Some of these techniques 
need to be multilingual in nature, including the abilities for machine translation and 
crosslingual information retrieval (CLIR). Spatial and temporal mining and visual- 
ization are often needed for geographic information systems (GIS) and temporal 
analysis of criminal and terrorist events. Most crime analysts are well trained in 
GIS-based crime mapping tools; however, automated spatial and temporal pattern 
mining techniques (e.g., hotspot analysis) have not been adopted widely in intelli- 
gence and security applications. 

Organized criminals (e.g., gangs and narcotics) and terrorists often form inter- 
connected covert networks for their illegal activities. Often referred to as “dark 
networks,” these organizations exhibit unique structures, communication chan- 
nels, and resilience to attack and disruption. New computational techniques, 
including social network analysis, network learning, and network topological 
analysis (e.g., random network, small-world network, and scale-free network), are 
needed for the systematic study of those complex and covert networks. We broadly 
consider these techniques under criminal network analysis in Fig. 2.1. 



2.1 Caveats for Data Mining 

Before we review in detail relevant ISI-related data mining techniques, applica- 
tions, and literature in the next chapter, we wish to briefly discuss the legal and ethi- 
cal caveats regarding crime and intelligence research. 

The potential negative effects of intelligence gathering and analysis on the pri- 
vacy and civil liberties of the public have been well publicized (Cook and Cook 
2003). There exist many laws, regulations, and agreements governing data collec- 
tion, confidentiality, and reporting, which could directly impact the development 
and application of ISI technologies. We strongly recommend that intelligence and 
security agencies and ISI researchers be aware of these laws and regulations in 
research. Moreover, we also suggest that a hypothesis-guided, evidence-based 
approach be used in crime and intelligence analysis research. That is, there should 
be probable and reasonable causes and evidence for targeting particular individuals 
or datasets for analysis. Proper investigative and legal procedures need to be strictly 
followed. It is neither ethical nor legal to “fish” for potential criminals from diverse 
and mixed crime-, intelligence-, and citizen-related data sources. The well-publicized 
Defense Advanced Research Program Agency (DARPA), Total Information 
Awareness (TIA) Program, and the Multistate Anti-Terrorism Information Exchange 
(MATRIX) system, for example, have recently been shut down due to their potential 
misuse of citizen data and impairment of civil liberties (American Civil Liberties 
Union 2004; O’Harrow 2005). 
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2.2 Domestic Security, Civil Liberties, and Knowledge Discovery 

In an important recent review article by Strickland, Baldwin, and Justsen (Strickland 
et al. 2005), the authors provide an excellent historical account of government sur- 
veillance in the United States. The article presents new surveillance initiatives in the 
age of terrorism (including the passage of the USA Patriot Act), discusses in great 
depth the impact of technology on surveillance and citizen’s rights, and proposes 
balancing between needed secrecy and oversight. We believe this is one of the most 
comprehensive articles addressing civil liberties issues in the context of national 
security research. We summarize some of the key points made in the article in the 
context of our proposed ISI research. 

Framed in the context of domestic security surveillance, the paper considers sur- 
veillance as an important intelligence tool that has the potential to contribute signifi- 
cantly to national security but also to infringe civil liberties. As faculty of the 
University of Maryland Information Science Department, the authors believe that 
information science and technology has drastically expanded the mechanisms by 
which data can be collected, and knowledge extracted and disseminated through 
some automated means. 

An immediate result of the tragic events of September 11, 2001, was the extraor- 
dinarily rapid passage of the USA Patriot Act in late 2001. The legislation was 
passed by the Senate on October 11, 2001, by the House on October 24, 2001, and 
signed by the President on October 26, 2001. The continuing legacy of the then- 
existing consensus and the lack of detailed debate and considerations created a bit- 
ter ongoing national argument as to the proper balance between national security 
and civil liberties. The Patriot Act contains ten titles in 131 pages. It amends numer- 
ous laws, including, for example, expansion of electronic surveillance of communi- 
cations in law enforcement cases, authorizing sharing of law enforcement data with 
intelligence, expansion of the acquisition of electronic communications as well as 
commercial records for intelligence use, and creation of new terrorism-related 
crimes. 

However, as new data mining and/or knowledge discovery techniques become 
mature and potentially useful for national security applications, there are great con- 
cerns of violating civil liberties. Both the DARPA’s TIA Program and the 
Transportation Security Administration’s (TSA) Computer Assisted Passenger 
Prescreening Systems (CAPPS II) were cited as failed systems that faced significant 
media scrutiny and public opposition. Both systems were based on extensive data 
mining of commercial and government databases collected for one purpose and to 
be shared and used for another purpose, and both systems were sidetracked by a 
widely perceived threat to personal privacy. Based on much of the debate generated 
by these programs, the authors suggest that data mining using public or private sec- 
tor databases for national security purposes must proceed in two stages - first, the 
search for general information must ensure anonymity; second, the acquisition of 
specific identity, if required, must be by court order under appropriate standards 
(e.g., in terms of “special needs” or “probable causes”). 
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In their concluding remarks, the authors cautioned that secrecy in any organization 
could pose a real risk of abuse and must be constrained through effective checks and 
balances. Moreover, information science and technology professionals are ideally 
situated to provide the tools and techniques by which the necessary intelligence is 
collected, analyzed, and disseminated, while civil liberties are protected through 
established laws and policies. 

In addition to the review article by Strickland et al., readers are also referred to 
an excellent book entitled “No Place to Hide,” written by Washington Post reporter 
Robert O’Harrow (2005). He reveals how the government is creating a national 
intelligence infrastructure with the help of private information, security, and tech- 
nology companies. The book examines in detail the potential impact of this new 
national security system on our traditional notions of civil liberties, autonomy, and 
privacy. 



2.3 Research Opportunities 

National security research poses unique challenges and opportunities. Much of the 
established data mining and knowledge discovery literature, findings, and tech- 
niques need to be reexamined in light of the unique data and problem characteristics 
in the law enforcement and intelligence community. New text mining, spatial and 
temporal pattern mining, and criminal network analysis of relevance to national 
security are among some of the most pressing research areas. However, researchers 
cannot conduct research in a vacuum. Partnerships with local, state, and federal 
agencies need to be formed to obtain relevant test data and necessary domain exper- 
tise for ISI research. Only after rigorous testing with scrubbed or anonymous data 
can select techniques be field examined and verified by domain experts (i.e., law 
enforcement personnel, intelligence analysts, and policy makers). These techniques 
should be used in actual investigations only after experts have confirmed their 
potential value. At this stage, the researcher-designed algorithms or systems are 
often much improved and refined, and are often operated and controlled by the 
domain experts with their own heuristics, know-how, and judgment. 
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Chapter 3 

Terrorism Informatics 



1 Introduction 

Terrorism informatics is defined as the “application of advanced methodologies and 
information fusion and analysis techniques to acquire, integrate, process, analyze, 
and manage the diversity of terrorism-related information for national/international 
and homeland security-related applications” (Chen et al. 2008). These techniques 
are derived from disciplines such as computer science, informatics, statistics, math- 
ematics, linguistics, social sciences, and public policy. Because the study of terror- 
ism involves copious amounts of information from multiple sources, data types, and 
languages, information fusion and analysis techniques such as data mining, text 
mining, web mining, data integration, language translation technologies, and image 
and video processing are playing key roles in the future prevention, detection, and 
remediation of terrorism. Although there has been substantial investment and 
research in the application of computer technology to terrorism research, much of 
the literature in this emerging area is fragmented and often narrowly focused within 
specific domains. There is a critical need to develop a multidisciplinary approach to 
answering important terrorism-related research questions. 



2 Terrorism and the Internet 

Terrorism is the systematic use of terror especially as a means of coercion. At pres- 
ent, the international community has been unable to formulate a universally agreed, 
legally binding, criminal law definition of terrorism. Common definitions of terror- 
ism refer only to those violent acts which are intended to create fear (terror), are 
perpetrated for an ideological goal, and deliberately target or disregard the safety of 
noncombatants (civilians) (http://en.wikipedia.org/wiki/Terrorism). Rooted in polit- 
ical science, terrorism study has attracted researchers from many social science dis- 
ciplines, from international relations to communications, and from defense analysis 
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to intelligence study. Bruce Hoffman of the Georgetown University School of 
Foreign Service and Brian Jenkins of the RAND Corporation are some of the promi- 
nent scholars in terrorism and counterinsurgency study. 

More recently, several terrorism scholars have begun to look into modern terror- 
ism with a more data and network centric perspective. Marc Sageman’s two criti- 
cally acclaimed books, Understanding Terror Networks (2004) and Leaderless 
Jihad (2008), are good examples. As stated on the book cover of Understanding 
Terror Networks'. 

For decades, a new type of terrorism has been quietly gathering ranks in the world. American 
ability to remain oblivious to the new movements ended on September 11, 2001. The Islamic 
fanatics in the global Salafi jihad (the violent, revivalist social movement of which al Qaeda 
is a part) target the West, but their operations mercilessly slaughter thousands of people of all 
races and religions throughout the world. Marc Sageman challenges conventional wisdom 
about terrorism, observing that the key to mounting an effective defense against future attacks 
is a thorough understanding of the networks that allow the new terrorist to proliferate. 

(Sageman 2004). 

Based on intensive data collection and analysis of documents from international 
press and court hearings on 172 important jihadists, Sageman was able to look into 
the social bonds, predating ideological commitment. Many important network- 
based observations about these groups were identified with striking examples, 
including small-world network and clique formation, network robustness, wide 
geographical distribution, fuzzy boundaries, and the strength of weak bonds. The 
Internet has also been found to affect the global jihad by making possible a new type 
of relationship between an individual and a virtual community (Sageman 2004). In 
his 2007 book “Leaderless Jihad,” Sageman continued his rigorous and systematic 
analysis of his detailed, evidence-based terrorist (500+ members) database. He 
described that “The process of radicalization that generates small, local, self- 
organized groups in a hostile habitat but linked through the Internet leads to a dis- 
connected global network, the leaderless jihad” (Sageman 2008). He urged terrorism 
study to go from anecdote to data and from journalism to social sciences. Among 
his findings, he found that before 2004, face-to-face interactions were more com- 
mon among the average 26-year-old jihadi members; while after 2004, most inter- 
actions were on the Internet, and the average member was 20 years of age. In a 
matter of 3-4 years and consistent with the modern information communication and 
technology (ICT) evolution, the Internet and the social media are helping the radical 
Islamists create a global, virtual social movement. 

In addition to Sageman’s influential work, several scholars have also examined 
the impact of the Internet on the proliferation and radicalization of the global jihadi 
movement. In his seminal book in 2006, “Terror on the Internet,” Gabriel Weimann 
(2006) of the Haifa University Department of Communications in Israel reported 
his 8-year study of Internet use by terrorist organizations and their supporters. 
Sophisticated web sites have been found to help these organizations raise funds, 
recruit members, plan and launch attacks, and publicize their chilling results. 
Weimann describes the Internet as the new media for promoting new terrorism to 
a new generation of audience and for winning the war over minds. In Johnny 
Ryan’s 2007 book “Countering Militant Islamist Radicalization on the Internet,” 
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he presented the EU’s perspective on such developments, especially for the Europe- 
based homegrown radical groups. Due to the ubiquity and scale of the Internet, he 
suggested pooling technical and linguistic resources to monitor extremism on the 
Internet for the EU member states. He also suggested disseminating moderate 
opinions of credible Muslim scholars and web opinion leaders and encouraged 
user-driven content in social media to counter radicalization and violence. 

As summarized in Chap. 1, the Dark Web research program of the University of 
Arizona Artificial Intelligence Lab is a complementary effort in the emerging disci- 
pline of terrorism informatics (Chen 2006). Unlike the social sciences approach 
adopted by the abovementioned scholars, the Dark Web project adopts data, sys- 
tem, and computational approaches to studying terror and terrorism on the Internet. 
By collecting and analyzing a large-scale, longitudinal, and fluid collection of 
terrorist-generated content using computer programs, we offer our complementary 
perspective and approach in understanding the overwhelmingly complex interna- 
tional terrorism landscape. Hsinchun Chen’s 2006 book “Intelligence and Security 
Informatics” reports selected examples from the Dark Web project at its early stage. 
This book serves to report significant findings and observations from the recent 
Dark Web developments. 

The edited volume of “Terrorism Informatics” (2008) by Chen, Reid, Sinai, 
Silke, and Ganor became the first manuscript dedicated to the terrorism informatics 
topic. The book is highly interdisciplinary, with editors and contributors from both 
social sciences and computational science. The goal of the book is to present terror- 
ism informatics along two highly intertwined dimensions: methodological issues in 
terrorism research, including information diffusion techniques to support terrorism 
prevention, detection, and response; and legal, social, privacy, and data confidential- 
ity challenges and approaches. 



3 Terrorism Research Centers and Resources 

Terrorism informatics relies heavily on terrorism domain knowledge and databases. 
We summarize some critical terrorism research centers and resources below based 
on our Dark Web experience. They are grouped into three major categories: (1) 
think tanks and intelligence resources, (2) terrorism databases and online resources, 
and (3) higher education research institutes. Clearly, the summary is not intended to 
be exhaustive. We only hope to help (terrorism) community outsiders (like our- 
selves or other computer scientists) get a glimpse of possible terrorism-related 
resources to familiarize themselves with this complex area. For each research unit, 
we also provide a web link for readers to find additional detailed information. 



3.1 Think Tanks and Intelligence Resources 

The RAND Corporation is a US nonprofit research and development outfit. It 
has evolved from a think tank during World War II to an independent corporation. 
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Document Information 

Building an Army of Believers 
Jihadist Radicalization and Recruitment 

By; Brian Michael Jenkins 

Testimony presented before the House Homeland Security Committee, Subcommittee on intelligence, 
Information Sharing and Terrorism Risk Assessment on April 5, 2007. 



Free, downloadable PDF file(s) are available below. 

Full Document 

(Filo size 0.1 MB, < 1 minute modem, < 1 minute broadband) 

RAND makes an electronic version of this document available for free as a public service. 
Use Adobe Acrobat Reader version 7.0 or higher for the best experience. 




Fig. 3.1 Sample RAND report 



Two prominent terrorism scholars, Brian Jenkins and Bruce Hoffman, are affiliated 
with the RAND Corporation. The company has been influential with its many excel- 
lent, timely, and thorough reports of international relations, political violence, and 
terrorism studies. For examples, see: http://www.rand.org/media/experts/policy_ 
areas/homeland_security_and_terrorism/index.html (Fig. 3.1). 

As part of the West Point military academy, the Combating Terrorism Center 
(CTC) has provided counterterrorism strategic analyses independent from academy 
curricula, Pentagon tactics, or US government politics since 2003. Among its nota- 
ble topical reports are the Harmony project (making sense of DoD al-Qaeda docu- 
ment database) and the Islamic imagery project. For more detail, see: http://ctc. 
usma.edu/harmony/harmony_docs.asp (Fig. 3.2). 

The International Institute for Counter-Terrorism (ICT), located in Herzliya, 
Israel, provides coverage of Middle Eastern events from an Israeli perspective. The 
institute regularly produces reports, commentaries, and multimedia contents. Its 
highly successful annual conference typically draws 400-800 participants from all 
over the world. For more information, see: http://www.ict.org.il/ (Fig. 3.3). 



3.2 Terrorism Databases and Online Resources 

Internet Haganah (Hebrew word for “defense”) is heavily involved in monitoring 
and disabling terror web sites on the web. It alerts counterterrorism vigilantes and 
indirectly fosters information “warfare.” Its Open Source Intelligence (OSINT) gath- 
ers and catalogs available intelligence documents of relevance to terrorism. For more 
information, see: http://internet-haganah.com/haganah/internet.html (Fig. 3.4). 
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Harmony Reports 



Cracks in the Foundation: Leadership AJ-Qarda's Foreign Fighters in Iraq: A 
Schisms in ai-Qa'Ida from 1989*2006 First Look at the Sinjar Records 




Fig. 3.2 The CTC harmony reports 
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Fig. 3.3 Sample ICT multimedia content 



Middle East Media Research Institute (MEMRI) is a nonprofit Washington, 
D.C.-based organization, with branches in Europe, Japan, and Israel. It regularly 
translates Arabic, Persian, and Turkish media and annotates videos, news articles, 
and web sites in the region. It also provides Islamic reformers a platform by translat- 
ing their ideas and thoughts. For more information, see: http://memri.org/index. 
html (Fig. 3.5). 

The Memorial Institute for the Prevention of Terrorism (MIPT) is a nonprofit 
organization funded by the US Department of Homeland Security (DHS). Based in 
Oklahoma City, Oklahoma, the 1995 federal building bombing spurred interest in a 
terrorism information repository, which resulted in creation of the MIPT Terrorism 
Knowledge Base (TKB). TKB contains two separate terrorist incident databases, 
the RAND Terrorism Chronology 1968-1997 and the RAND-MIPT Terrorism 
Incident database (1998-Present). The TKB ceased operations on March 31, 2008. 
For more information, see: http://www.mipt.org/ (Fig. 3.6). 
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Object lesson in Information Warfare 

Al-Manar TV on THAICOM? Not any more! 
The ammunition: 




Fig. 3.4 Sample Internet Haganah information on the web 
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Fig. 3.5 Sample MEMRI web content 



Also funded by the DHS, the University of Maryland’s National Consortium for 
the Study of Terrorism and Responses to Terrorism (START) has extensive research 
about terrorist group formation and recruitment, persistence and dynamics, and 
societal responses to terrorist threats and attacks. It provides a searchable open 
source Global Terrorism Database (GTD), presenting information on terrorist events 
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Quick tacts >> | 

Date Formed: 

1991 

Strength: 

Greater than 200 members 

Classification: 

Nationalist/Separatist, 

Religious 

Last Attack: 

Dec. 28, 2006 
Financial Sources: 

Largely self-financed through 
ransom and extortion ; 
Suspected to receive support 
from Islamic extremists in the 
Middle East and South Asia 

key leaders 

Andong, Ghalib 
Hapilon, Isnilon Totoni 



19 group profile 

ABU SAVVAF GROUP (ASG) 

Mothertongue Name: 

Aliases: Bearer of the Sword, 
al-Harakat al-Islamiyah 

Base of Operation: Philippines 




Founding Philosophy: The Abu Sayyaf Group (ASG), or 
Abu Sayyaf, is a radical Islamic terrorist group active in the 
Southern Philippines and Malaysia. Its stated goal is the 
creation of an independent Islamic state encompassing parts 
of Southern Thailand, the island of Borneo, the Sulu 
Archipelago, and Mindanao, areas where Moro Muslims, a 
minority ethnic group in the Philippines, make up the majority 
of the local population. The ASG is known to target Filipino 
and Western Christians m the Southern Philippines, though 
the group's influence is thought to have expanded to the 
regional level recently. 



Igasan, Yasser 
Janjalani, Abdulrajik 
Janjalani, Abubakar Khadaffy 
more 



The ASG was founded in 199 1 by radical Moro National 
Liberation Front (MNLF) members who objected to the MNLF's 
negotiations with the Philippine government. Due to the ASG 's 
predisposition toward violent tactics, which include 
high-profile bombings, armed attacks, assassinations, and 



Fig. 3.6 MIPT and its terrorism knowledge base 
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^ I I Study of Terrorism and Responses to Terrorism 

A CENTER OF EXCELLENCE OF THE U.S. DEPARTMENT OF HOMELAND SECURITY BASED AT THE UNIVERSITY OF MARYLAND 

GTD 

Global Terrorism Database 






Fig. 3.7 START and its global terrorism database 

around the world since 1970 (currently updated through 2007), including data on 
where, when, and how each of over 80,000 terrorist events occurred. For more 
information, see: http://www.start.umd.edu/start/ (Fig. 3.7). 

Originally founded to document the crimes of the Holocaust and find criminals 
responsible for the genocide, the Simon Wiesenthal Center has a targeted tolerance 
education mission which it carries out through the Snider Social Action Institute 
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Product Detail 



Digital Terrorism ♦ Hate 



2 0 0 7 




Digital Terrorism & Hate 2007 
AL28 

Item Number: AL28 

Digital Terrorism and Hate 2007 is the Simon Wiesenthal Center's 
newly released ninth annual, interactive report exposing terrorism and 
hate on the Internet. Compiled by Center researchers, the 2007 
edition is culled from close to 7,000 problematic websites, blogs, 
newsgroups, youtubeTM and other on-demand video sites. Key 
sections of Digital Terrorism are available in English, French and 
Spanish. 



Name: 

Product 

Code: 

Description: 

Product 

Detail: 



Fig. 3.8 Simon Wiesenthal Center and its digital terrorism DVDs 



First Issue of “Echo of the Epics” - A Magazine from Al-Qaeda in Yemen 

By SITE Intelligence Group 



January 14. 2008 




Fig. 3.9 Sample SITE report 

and its Museum of Tolerance. It also produces the Digital Terrorism DVDs with 
extremist web site snapshots as part of its larger educational mission. For more 
information, see: http://www.wiesenthal.com (Fig. 3.8). 

SITE (Search for International Terrorist Entities) was founded in 2002 by under- 
cover activist Rita Katz. The Site Intelligence Group, a for-profit organization, is 
now monitoring terrorist activities. It keeps translations of terrorist media and docu- 
ments and makes them available via subscription to media, governments, and cor- 
porations. For more information, see: http://www.siteintelgroup.org/ (Fig. 3.9). 
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Latest additions to CSTPV e-library (from Open Sources) 



no* 

Tackling terror Svenqalis on web a pnontv 

AJ Qaidas "MvSpace" Terronst Recruitment on the Internet 

Al Qaidas Extensive Use of the Internet 

The Changing Face of SalafiJihadi Movements in the United Kingdom 
Evolution of Jihadism in Spain Following the 3/1 1 Madrid Terronsts Attacks 
The Modem Terronst Threat to Aviation Security 

The radical dawa in transition The rise of Islamic neoradicalism in the Netherlands 
Turkey's Other War on Terronsm 

South Waziri Tribesmen Organize Countennsurqencv Lashkar 
Insurrection in Iranian Balochistan 



Author 

Johnson. Philip 
Kohlmsnn . Ev»« F 
W«im»nn. G*Dn«l 
Bron^oo. J*rr*» 
Jordon. Jovior 
For*« J om«» J.F. 

■Ionian*. Goroth 
MsGrogor. AnOrow 
ZomfcoH*. Chris 



Fig. 3.10 Sample CSTPV reports 



3.3 Higher Education Research Institutes 

The University of St. Andrews Centre for the Study of Terrorism and Political 
Violence (CSTPV) provides a political science perspective to terrorism. The pro- 
gram offers subscription-based access to political analyses and awards distance 
learning terrorism study certificates. Many influential political science-grounded 
terrorism scholars were trained at St. Andrews. For more information, see: http:// 
www.st-andrews.ac.uk/~wwwir/research/cstpv/ (Fig. 3.10). 

The International Centre for Political Violence and Terrorism Research 
(ICPVTR) is a research and education center within the S. Rajaratnam School of 
International Studies (RSIS) at Nanyang Technological University, Singapore. 
ICPVTR conducts research, training, and outreach programs aimed at reducing the 
threat of politically motivated violence and at mitigating its effects on the interna- 
tional system. Its Global Pathfinder System is a one-stop repository for information 
on the current and emerging terrorist threat. The database focuses on terrorism and 
political violence in the Asia-Pacific region - comprising Southeast Asia, North 
Asia, South Asia, Central Asia, and Oceania. For more information, see: http:// 
www.pvtr.org/ (Fig. 3.11). 

Sponsored by many US government agencies, the Dartmouth Institute for 
Security Technology Studies (ISTS) focuses on cyber security, trust, and cyberter- 
rorism. Its project topics include hardening IT infrastructure against attack, image 
and video forensics, trusted digital certificate provision, information infrastructure 
risk assessment, etc. For more information, see: http://www.ists.dartmouth.edu/ 
library. php (Fig. 3.12). 
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Fig. 3.11 ICPVTR teiTorism database 
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Fig. 3.12 The Dartmouth ISTS web site 
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4 Conclusions 

As the field of terrorism informatics continues to grow and evolve, we anticipate 
broader collaboration between social scientists and computational researchers who 
are interested in counterterrorism research. We also expect new methodologies, ter- 
rorism databases, and computational approaches to emerge and mature based on the 
rich content and complex interactions produced by the terrorists and extremists on 
the Internet. 
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Part II 

Dark Web Research: Computational 
Approach and Techniques 



Chapter 4 

Forum Spidering 



1 Introduction 

The Internet acts as an ideal method for information and propaganda dissemination 
(Whine 1997; Gustavson and Sherkat 2004). Computer-mediated communication 
offers a quick, inexpensive, and anonymous means of communication for extremist 
groups (Crilley 2001). Extremist groups frequently use the web to promote hatred 
and violence (Glaser et al. 2002). This problematic facet of the Internet is often 
referred to as the Dark Web (Chen 2006). An important component of the Dark Web 
is extremist forums hidden deep within the Internet. Many have stated the need for 
collection and analysis of Dark Web forums (Burris et al. 2000; Schafer 2002). Dark 
Web materials have important implications for intelligence and security informat- 
ics-related applications (Chen 2006). The collection of such content is also impor- 
tant for studying and understanding the diverse social and political views present in 
these online communities. 

The unprecedented growth of the Internet has resulted in considerable focus on 
web crawling/spidering techniques in recent years. Crawlers are defined as “soft- 
ware programs that traverse the World Wide Web information space by following 
hypertext links and retrieving web documents by standard HTTP protocol” (Cheong 
1996). They are programs that can create a local collection or index of large vol- 
umes of web pages (Cho and Garcia-Molina 2000). Crawlers can be used for gen- 
eral-purpose search engines or for domain-specific collection building. The latter 
are referred to as focused or topic-driven crawlers (Chakrabarti et al. 1999; Pant 
et al. 2002). 

There is a need for a focused crawler that can collect Dark Web forums. Many 
previous focused crawlers have focused on collecting static English web pages 
from the “surface web.” A Dark Web forum-focused crawler faces several design 
challenges. One major concern is accessibility. Web forums are dynamic and 
often require memberships. They are part of the “hidden web” (Florescu et al. 
1998; Raghavan and Garcia-Molina 2001) which is not easily accessible through 
normal web navigation or standard crawling. There are also multilingual web 

H. Chen, Dark Web: Exploring and Data Mining the Dark Side of the Web, 45 
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mining considerations. More than 30% of the web is in non-English languages 
(Chen and Chau 2003). Consequently, the Dark Web also encompasses numerous 
languages. Another important concern is content richness. Dark Web forums con- 
tain rich content used for routine communication and propaganda dissemination 
(Abbasi and Chen 2005; Zhou et al. 2005; Qin et al. 2005). These forums contain 
static and dynamic text files, archive files, and various forms of multimedia (e.g., 
images, audio, and video files). Collection of such diverse content types intro- 
duces many unique challenges not encountered with standard spidering of index- 
able (text-based) files. 

In this chapter, we propose the development of a focused crawler that can collect 
Dark Web forums. Our spidering system uses breadth- and depth-first (BFS and 
DFS) traversal based on URF tokens, anchor text, and link levels for crawl space 
URF ordering. We also utilize incremental crawling for collection updating using 
wrappers to identify updated content. The system also includes design elements 
intended to overcome the previously mentioned accessibility, multilingual, and con- 
tent richness challenges. For accessibility, we use a human-assisted approach 
(Raghavan and Garcia-Molina 2001) for attaining Dark Web forum membership. 
Our system also includes tailored spidering parameters and proxies for each forum 
in order to improve accessibility. The crawler uses language-independent features 
for crawl space URF ordering in order to negate any complications attributable to the 
presence of numerous languages. We also incorporate iterative collection of incom- 
plete downloads and relevance feedback for improved multimedia collection. 

The remainder of this chapter is organized as follows: Section 4.2 presents a 
review of related work on focused and hidden web crawling. Section 4.3 describes 
research gaps and our related research questions. Section 4.4 describes a research 
design geared toward addressing those questions. Section 4.5 presents a detailed 
description of our Dark Web forum spidering system. Section 4.6 describes experi- 
mental results evaluating the efficacy of our human-assisted approach for gaining 
access to Dark Web forums as well as the incremental update procedure that uses 
recall improvement. This section also highlights the Dark Web forum collection 
statistics for data gathered using the proposed system. Section 4.7 contains conclud- 
ing remarks. 



2 Related Work: Focused and Hidden Web Crawlers 

Focused crawlers “seek, acquire, index, and maintain pages on a specific set of top- 
ics that represent a narrow segment of the web” (Chakrabarti et al. 1999). The need 
to collect high-quality, domain-specific content results in several important charac- 
teristics for such crawlers that are also relevant to collection of Dark Web forums. 
Some of these characteristics are specific to focused and/or hidden web crawling 
while others are relevant to all types of spiders. We review previous research per- 
taining to these important considerations, which include accessibility, collection 
type and content richness, URF ordering features and techniques, and collection 
update procedures. 
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2.1 Accessibility 

Most search engines cover what is referred to as the “publicly indexable web” 
(Lawrence and Giles 1998; Raghavan and Garcia-Molina 2001). This is the part of 
the web easily accessible with traditional web crawlers (Sizov et al. 2003). As noted 
by Lawrence and Giles (1998), a large portion of the Internet is dynamically gener- 
ated. Such content typically requires users to have prior authorization, fill out forms, 
or register (Raghavan and Garcia-Molina 2001). This covert side of the Internet is 
commonly referred to as the hidden/deep/invisible web. Hidden web content is often 
stored in specialized databases (Lin and Chen 2002). For example, the IMDB movie 
review database contains a plethora of useful information regarding movies, yet 
standard crawlers cannot access this information (Sizov et al. 2003). A study con- 
ducted in 2000 found that the invisible web contained 400-550 times the informa- 
tion present in the traditional surface web (Bergman 2000; Lin and Chen 2002). 

Two general strategies have been introduced to access the hidden web via auto- 
mated web crawlers. The first approach entails use of automated form-filling tech- 
niques. Several different automated query generation approaches for querying such 
“hidden web” databases and fetching the dynamically generated content have been 
proposed (e.g., Barbosa and Freire 2004; Ntoulas et al. 2005). Other techniques 
keep an index of hidden web search engines and redirect user queries to them (Lin 
and Chen 2002) without actually indexing the hidden databases. However, many 
automated approaches ignore/exclude collection or querying of pages requiring log- 
in (e.g., Lage et al. 2002). Thus, automated form-filling techniques seem problem- 
atic for Dark Web forums where log-in is often required. 

A second alternative for accessing the hidden web is a task-specific human-as- 
sisted approach (Raghvan and Garcia-Molina 2001). This approach provides a 
semiautomated framework that allows human experts to assist the crawler in gain- 
ing access to hidden content. The amount of human involvement is dependent on the 
complexity of the accessibility issues faced. For example, many simple forms ask- 
ing for name, e-mail address, etc., can be automated with standardized responses. 
Other more complex questions require greater expert involvement. Such an approach 
seems more suitable for the Dark Web, where the complexity of the access process 
can vary significantly. 



2.2 Collection Type 

Previous focused crawling research has been geared toward collecting web sites, 
blogs, and web forums. There has been considerable research on collection of stan- 
dard web sites and pages relating to a particular topic, often for portal building. 
Srinivasan et al. (2002) and Chau and Chen (2003) fetched biomedical content 
from the web. Sizov et al. (2003) collected web pages pertaining to handicrafts and 
movies. Pant et al. (2002) evaluated their topic crawler on various keyword queries 
(e.g., “recreation”). 
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There has also been work on collecting weblogs. BlogPulse (Glance et al. 2004) 
is a blog analysis portal. The site contains analysis of key discussion topics/trends 
for roughly 100,000 spidered weblogs. Such blogs can also be useful for marketing 
intelligence (Glance et al. 2005). Blogs containing product reviews analyzed using 
sentiment analysis techniques can provide insight into how people feel about vari- 
ous products. 

Web forum crawling presents a unique set of difficulties. Discovering web forums 
is challenging due to the lack of a centralized index (Glance et al. 2005). Furthermore, 
web forums require information extraction wrappers for derivation of metadata 
(e.g., authors, messages, time stamps, etc.). Wrappers are important for data analy- 
sis and incremental crawling (respidering only those threads containing newly 
posted messages). Incremental crawling is discussed in greater detail in the 
“Collection Update” section. There has been limited research on web forum spider- 
ing. BoardPulse (Glance et al. 2005) is a system for harvesting messages from 
online forums. It has two components: a crawler and a wrapper. Limanto et al. 
(2005) developed a web forum information extraction engine that includes a crawler, 
wrapper generator, and extractor (i.e., application of generated wrapper). Yih et al. 
(2004) created an online forum mining system composed of a crawler and informa- 
tion extractor for mining deal forums. There has been no prior research on collect- 
ing Dark Web forums. 



2.3 Content Richness 

The web is rich in indexable and multimedia files. Indexable files include static text 
files (e.g., HTML, Word, and PDF documents) and dynamic text files (e.g., .asp, . 
jsp, and .php). Multimedia files include images, animation, audio, and video files. 
Difficulties in indexing make multimedia content difficult to accurately collect 
(Baeza- Yates 2003). Multimedia file sizes are typically significantly larger than 
indexable files, resulting in longer download time and frequent time-outs. Heydon 
and Najork (1999) fetched all MIME file types (including image, video, audio, and 
.exe files) using their Mercator crawler. They noted that collecting such files 
increased the overall spidering time and doubled the average file size as compared 
to just fetching HTML files. Consequently, many previous studies have ignored 
multimedia content altogether (e.g., Pant et al. 2002). 



2.4 URL Ordering Features 

Aggarwal et al. (2001) pointed out four categories of features for crawl space URL 
ordering. These include links, URL and/or anchor text, page text, and page levels. 
Link - based features have been used considerably in previous research. Many studies 
have used in-/back-links and out-links (Cho et al. 1998; Pant et al. 2002). Sibling links 
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(Aggarwal et al. 2001) consider sibling pages (ones with shared parent in-link). 
Context graphs (Diligenti et al. 2000) derive back-links for each seed URL and use 
these to construct a multilayer context graph. Such graphs can be used to extract paths 
leading up to relevant nodes (target URLs). Focused/topical crawlers often use bag- 
of-words (BOW) found in the web page text (Aggarwal et al. 2001; Pant et al. 2002). 
For instance, Srinivasan et al. (2002) used BOW for biomedical text categorization in 
their focused crawler. While page text features are certainly very effective, they are 
also language dependent and can be harder to apply in situations where the collection 
is composed of pages in numerous languages. Other studies have also used URL/ 
anchor text. Word tokens found within the URL anchor have been used effectively to 
help control the crawl space (Cho et al. 1998; Ester et al. 2001). URL tokens have also 
been incorporated in previous focused crawling research (Aggarwal et al. 2001; Ester 
et al. 2001). Another important category of features for URL ordering is page levels. 
Diligenti et al. (2000) trained text classifiers to categorize web pages at various levels 
away from the target. They used this information to build path models that allowed 
consideration of irrelevant pages as part of the path to attain target pages. A potential 
path model may consider pages one or two levels away from a target, known as tun- 
neling (Ester et al. 2001). Ester et al. (2001) used the number of slashes or levels 
from the domain as an indicator of URL importance. They argued that pages closer to 
the main page are likely to be of greater importance. 



2.5 URL Ordering Techniques 

Previous research has typically used breadth-, depth-, and best-first search for URL 
ordering. Depth-first search (DFS) has been used in crawling systems such as Fish 
Search (De Bra and Post 1994). Breadth-first search (BFS) (Cho et al. 1998; Ester 
et al. 2001; Najork and Wiener 2001) is one of the simplest strategies. It has worked 
fairly well in comparison with more sophisticated best-first search strategies (Cho 
et al. 1998; Najork and Wiener 2001). However, BFS is typically not employed by 
focused crawlers that are concerned with identifying topic-specific web pages using 
the aforementioned URL ordering features. 

Best-first uses some criterion for ranking URLs in the crawl space, such as link 
analysis or text analysis. Numerous link analysis techniques have been used for 
URL ordering. Cho et al. (1998) evaluated the effectiveness of Page Rank and back- 
link counts. Pant et al. (2002) also used Page Rank. Aggarwal et al. (2001) used the 
number of relevant siblings. They considered pages with a higher percentage of 
relevant siblings more likely to also be relevant. Sizov et al. (2003) used the HITS 
algorithm to compute authority scores, while Chakrabarti et al. (1999) used a modi- 
fied HITS. Chau and Chen (2003) used a Hopfield Net crawler that collected pages 
related to the medical domain based on link weights. 

Text analysis methods include similarity scoring approaches and machine learn- 
ing algorithms. Aggarwal et al. (2001) used similarity equations with page content 
and URL tokens. Others have used the vector space model and cosine similarity 
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measure (Pant et al. 2002; Srinivasan et al. 2002). Sizov et al. (2003) used support 
vector machines (SVM) with BOW for document classification. Srinivasan et al. 
(2002) used BOW and link structures with a neural net for ordering URLs based on 
the prevalence of biomedical content. Chen et al. (1998a; 1998b) used a genetic 
algorithm to order the URL crawl space for the collection of topic-specific web 
pages based on bag-of-word representations of pages. 



2.6 Collection Update Procedure 

Two approaches for collection updating are periodic and incremental crawling (Cho 
and Garcia-Molina 2001). Periodic crawling entails building of a brand-new collec- 
tion for updating. This is commonly done since it is often easier than figuring out 
which pages to refresh. Periodic crawling is inefficient from a spidering perspective 
(more time consuming). However, multiple versions of a collection may improve 
overall recall. Incremental crawling gathers new and updated content. In the case of 
web sites, this often requires some form of change frequency estimation (Cho and 
Garcia-Molina 2003) in order to determine which pages need to be updated. For 
web forums, this entails fetching only those threads that have been updated (Yih 
et al. 2003; Glance et al. 2005) since we only want to fetch newly posted messages. 
This requires the use of a wrapper that can parse out the “last updated” dates for 
threads and compare them against the previous collection to determine which pages 
need to be collected. 



2.7 Summary of Previous Research 

Table 4.1 provides a summary of selected previous research on focused crawling. 

The majority of studies have focused on collection of indexable files from the 
surface web. There have only been a few studies that performed focused crawling 
on the hidden web. Similarly, only a few studies have collected content from web 
forums. Most previous research on focused crawling has used bag-of-word (BOW), 
link, or URL token features coupled with a best-first search strategy for crawl space 
URL ordering. Furthermore, most prior research ignored the multilingual dimen- 
sion, only collecting content in a single language (usually English). Collection of 
Dark Web forums entails retrieving rich content (including indexable and multime- 
dia files) from the hidden web in multiple languages. Dark Web forum crawling is 
therefore at the cross section of several important areas of crawling research, many 
of which have received limited attention in prior research. The following section 
summarizes these important research gaps and provides a set of related research 
questions which are addressed in the remainder of this chapter. 
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3 Research Gaps and Questions 

Based on our review of previous literature, we have identified several important 
research gaps. 



3.1 Focused Crawling of the Hidden Web 

There has been limited focused crawling work on the hidden web. Most focused 
crawler studies developed crawlers for the surface web (Raghavan and Garcia- 
Molina 2001). Prior hidden web research mostly focused on automated form filling 
or query redirection to hidden databases, that is, accessibility issues. There has been 
little emphasis on building topic-specific web page collections from these hidden 
sources. We are not aware of any attempts to automatically collect Dark Web con- 
tent pertaining to hate and extremist groups. 



3.2 Content Richness 

Most previous research has focused on indexable (text-based) files. Large multime- 
dia files (e.g., videos) can be hundreds of megabytes. This can cause connection 
time-outs or excessive server loads, resulting in partial/incomplete downloads. 
Furthermore, the challenges in indexing multimedia files pose problems. It is diffi- 
cult to assess the quality of collected multimedia items. As Baeza- Yates (2003) 
noted, automated multimedia indexing is more of an image retrieval challenge than 
an information retrieval problem. Nevertheless, given the content richness of the 
Internet in general and the Dark Web in specific (Chen 2006), there is a need to 
capture multimedia files. 



3.3 Web Forum Collection Update Strategies 

There has been considerable research on evaluating various collection update strate- 
gies for web sites (e.g., Cho and Garcia-Molina 2000). However, there has been 
little work done on comparing the effectiveness of periodic versus incremental 
crawling for web forums. Most web forum research has assumed an incremental 
approach. Given the accessibility concerns associated with Dark Web forums, peri- 
odic and incremental approaches both provide varying benefits. Periodic crawlers 
can improve collection recall by allowing multiple attempts at capturing previously 
uncollected pages. This may be less of a concern for surface web forums but is 
important for the Dark Web. In contrast, incremental crawlers can improve collec- 
tion efficiency and reduce redundancy. There is a need to evaluate the effectiveness 
of periodic and incremental crawling applied to Dark Web forums. 
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3.4 Research Questions 

Based on the gaps described, we propose the following research questions: 

1 . How effectively can Dark Web forums be identified and accessed for collection 
purposes? 

2. How effectively can Dark Web content (indexable and multimedia) be 
collected? 

3. Which collection update procedure (periodic or incremental) is more suitable for 
Dark Web forums? How can recall improvement further enhance the update 
process? 

4. How can analysis of extracted information from Dark Web forums improve our 
understanding of these online communities? 



4 Research Design 

4.1 Proposed Dark Web Forum Crawling System 

In this chapter, we propose a Dark Web forum spidering system. Our proposed sys- 
tem consists of an accessibility component that uses a human-assisted registration 
approach to gain access to Dark Web forums. Our system also utilizes multiple 
dynamic proxies and forum-specific spidering parameter settings to maintain forum 
access. 

Our URL ordering component uses language-independent URL ordering fea- 
tures to allow spidering of Dark Web forums across languages. We plan to focus on 
groups from three regions: US domestic. Middle East, and Latin America/Spain. 
Additionally, a rule-based URL ordering technique coupled with BFS and DFS 
crawl space traversal is utilized. Such a technique is employed in order to minimize 
the amount of irrelevant web pages collected. 

We also propose the use of an incremental crawler that uses forum wrappers to 
determine the subset of threads that need to be collected. Our system will include a 
recall improvement procedure that parses the spidering log and reinserts incomplete 
downloads into the crawl space. Finally, the system features a collection analyzer 
that checks multimedia files for duplicate downloads and generates collection sta- 
tistics at the forum, region, and overall collection levels. 



4.2 Accessibility 

As noted by Raghavan and Garcia-Molina (2001), the most important evaluation 
criterion for hidden web crawling is how effectively the content was accessed. They 
developed an accessibility metric as follows: databases accessed/total attempted. 
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We intend to evaluate the effectiveness of the task-specific, human-assisted approach 
in comparison with not using such a mechanism. Specifically, we would also like to 
evaluate our system’s ability to access Dark Web forums. This translates into mea- 
suring the percentage of attempted forums accessed. 



4.3 Incremental Crawling for Collection Updating 

We plan to evaluate the effectiveness of our proposed incremental crawler in com- 
parison with periodic crawling. The incremental crawler will obviously be more 
efficient in terms of spidering time and data redundancy. However, a periodic crawl- 
ing approach gets multiple attempts to collect each page, which can improve overall 
collection recall. Evaluation of both approaches is intended to provide additional 
insight into which collection update technique is more suitable for Dark Web forum 
spidering. 



5 System Design 

Based on our research design, we implemented a focused crawler for Dark Web 
forums. 

Our system consists of four major components (shown in Fig. 4.1): 

• Forum identification : to identify the list of extremist forums to spider. 

• Forum preprocessing : includes accessibility and crawl space traversal issues as 
well as forum wrapper generation. 

• Forum spidering: consists of an incremental crawler and recall improvement 
mechanism. 

• Forum storage and analysis: to store and analyze the forum collection. 



5.1 Forum Identification 

The forum identification phase has three components: 

Step 1: Identify extremist groups 

Sources for the US domestic extremist groups include the Anti-Defamation League, 
FBI, Southern Poverty Law Center, Militia Watchdog, and the Google Web Directory 
(as a supplement). Sources for the international extremist groups include the United 
States Committee for a Free Lebanon, Counter-Terrorism Committee of the U.N. 
Security Council, US State Department report. Official Journal of the European 
Union, as well as government reports from the United Kingdom, Australia, Japan, 
and P. R. China. Due to regional and language constraints, we chose to focus on 
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Fig. 4.1 Dark Web forum crawling system design 



groups from three areas: North America (English), Latin America (Spanish), and 
the Middle East. These groups are all significant for their sociopolitical importance. 
Furthermore, collection and analysis of Dark Web content from these three regions 
can facilitate a better understanding of the relative social and cultural differences 
between these groups. In addition to obvious linguistic difference, groups from 
these regions also display different web design tendencies and usage behaviors 
(Abbasi and Chen 2005) which provide a unique set of collection and analysis 
challenges. 

Step 2: Identify forums from extremist web sites 

We identify an initial set of extremist group URLs and then use link analysis for 
expansion purposes as shown in Fig. 4.2. 

The initial set of URLs is identified from three sources. Firstly, we use search 
engines coupled with a lexicon containing extremist organization name(s), leader(s)’ 
and key members’ names, slogans, and special keywords used by extremists. 
Secondly, we utilize government reports. Finally, we reference research centers. 
A link analysis approach is used to expand the initial list of URLs. We incorporate 
a back-link search using Google, which has been shown to be effective in prior 
research (Diligenti et al. 2000), and also search out-links. The identified web forums 
are manually checked. 









56 



4 Forum Spidering 



Terrorism 
Lexicon 
(Organization 
names, leader 
names, 
slogans, 
special 
keywords...) 



Search Engines 
(Google, Yahoo, 
etc.) 



Government 
Reports 
(FBI, US State 
Department, UN 
Security Council, 
etc.) 



Research Centers 
(ATC, MEMRI, 
Dartmouth, 
Norwegian 
Research, etc.) 



n 

'J 


Initial 

URL 

Extre 

Gro 


Seed 
s of 
mist 
ups 



Identify Extremist Group URLs 



Back-link 

Extraction 



Out-link 

Extraction 



Manual 

Filtering 



Expanded URLs of 
Extremist Groups 



Expand Extremist Group URLs 
by Link Analysis 
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Step 3: Identify forums hosted on major web sites 

We also identify forums hosted by other web sites and public Internet service providers 
(ISPs) that are likely to be used by Dark Web groups, for example, MSN groups or 
AOL Groups. Public ISPs are searched with our Dark Web domain lexicon for a list 
of potential forums. 

The above three steps help identify a seed set of Dark Web forums. Once the 
forums have been identified, several important preprocessing issues must be 
resolved before spidering in order to develop proper features and techniques for 
managing the crawl space. These include accessibility concerns and identification 
of forum structure. 



5.2 Forum Preprocessing 

The forum preprocessing phase has three components: accessibility, structure, and 
wrapper generation. The accessibility component deals with acquiring and main- 
taining access to Dark Web forums. The structure component is designed to identify 
the forum URL mapping and devise the crawl space URL ordering using the rele- 
vant features and techniques. 



5.2.1 Forum Accessibility 

Step 1: Apply for membership 

Many Dark Web forums do not allow anonymous access (Zhou et al. 2006). In order 
to access and collect information from those forums, one must create a user ID and 
password, send an application request to the web master, and wait to get permission/ 
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Fig. 4.3 Proxies used for 
Dark Web forum crawling 




registration to access the forum. In certain forums, web masters are very selective. 
It can take a couple of rounds of e-mails to get access privilege. For such forums, 
human expertise is invaluable. Nevertheless, in some cases, access cannot be 
attained. 

Step 2: Identify appropriate spidering parameters 

Spidering parameters such as number of connections, download intervals, time-out, 
speed, etc., need to be set appropriately according to server and network limitations 
and the various forum blocking mechanisms. Dark Web forums are rich in terms of 
their content. Multimedia files are often fairly large in volume (particularly com- 
pared to indexable files). The spidering parameters should be able to handle down- 
load of larger files from slow servers. However, we may still be blocked based on 
our IP address. Therefore, we use proxies to increase not only our recall but also our 
anonymity. 

Step 3: Identify appropriate proxies 

We use three types of proxy servers, as shown in Fig. 4.3. Transparent proxy servers 
are those that provide anyone with your real IP address. Translucent proxy servers 
hide your IP address or modify it in some way to prevent the target server from 
knowing about it. However, they let anyone know that you are surfing through a 
proxy server. Opaque proxy servers (preferred) hide your IP address and do not 
even let anyone know that you are surfing through a proxy server. There are several 
criteria for proxy server selection, including the latency (the smaller the better), reli- 
ability (the higher the better), and bandwidth (the faster the better). We update our 
list of proxy servers periodically from various sources. 
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5.2.2 Forum Structure 

Step 1: Identify site maps 

Forums typically have hierarchical structures with boards, threads, and messages 
(Yih et al. 2004; Glance et al. 2005). They also contain considerable additional 
information such as message posting interfaces, search, and calendar pages. We first 
identify the site map of the forum based on the forum software packages. Glance 
et al. (2005) noted that although there are only a handful of commonly used forum 
software packages, they are highly customizable. 

Step 2: URL ordering features 

Our spidering system uses two types of language-independent URL ordering fea- 
tures, URL tokens and page levels. With respect to URL tokens, for web forums, we 
are interested in URLs containing words such as “board,” “thread,” “message,” etc. 
(Glance et al. 2005). Additional relevant URL tokens include domain names of 
third-party file hosting web sites. These third parties often contain multimedia files. 
File extension tokens (e.g., “.jpg” and “.wmv”) are also important. URLs that con- 
tain phrases such as “sort = voteavg” and “goto = next” are also found in relevant 
pages. However, these are not unique to board, thread, and message pages; hence, 
such tokens are not considered significant. The set of relevant URL tokens differs 
based on the forum software being used. Such tokens are language independent yet 
software specific. 

Page levels are also important as evidenced by prior focused crawling research 
(Diligenti et al. 2000; Ester et al. 2001). URL level features are important for Dark 
Web forums due to the need to collect multimedia content. Multimedia files are 
often stored on third-party host sites that may be a few levels away from the source 
URL. In order to capture such content, we need to use a rule-based approach that 
allows the crawler to go a few additional levels. For example, if the URL or anchor 
text contains a token that is a multimedia file extension or the domain name for a 
common third-party file carrier, we want to allow the crawler to “tunnel” a few 
links. 

Step 3: URL ordering techniques 

As mentioned in the previous section, we use rules based on URL tokens and levels 
to control the crawl space. Moreover, to adapt to different forum structures, we need 
to use different crawl space traversal strategies. 

Breadth-first (BFS) is used for board page forums while depth-first (DFS) is used 
for Internet service provider (ISP) forums. DFS is necessary for ISP forums since 
these forums often require traversing an ad page in order to get to the message page 
(typically, the ad pages have a link to the actual message page). Figure 4.4 illustrates 
how the BFS and DFS are performed for each forum type. Only the colored pages 
are fetched while the number indicates the order in which the pages are traversed by 
the crawler. 
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Fig. 4.4 URL traversal strategies 




Fig. 4.5 Spidering process 



5.2.3 Wrapper Generation 

Forums are dynamic archives that keep historical messages. It is beneficial to only 
spider newly posted content when updating the collection. This is achieved by gener- 
ating wrappers that can parse web forum board and thread pages (Glance et al. 2005). 
Board pages tell us when each thread was last updated with new messages. Using this 
information, one may respider only those thread pages containing new postings. 



5.3 Forum Spidering 

Figure 4.5 shows the spidering process. The incremental crawler fetches only new 
and updated threads and messages. A log file is sent to the recall improvement com- 
ponent. The log shows the spidering status of each URL. A parser is used to deter- 
mine the overall status for each URL (e.g., “download complete” and “connection 
timed out”). The parsed log is sent to the log analyzer which evaluates all files that 
were not downloaded. It determines whether the URLs should be respidered. 
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Log 



[2006-1 1-8 13:49:53] HTTP7: Host news.stcom.net connected. Waiting for http://news.stcom.iiet' 
file=viewtopic&t=2 121. 

[2006-1 1-8 13:50:10] I1TTP0: 31550 bytes of http://w'Ww.gmaas.com/vb/showthrcad.php?t=2 1476 
[2006-1 1-8 13:50:12] IITTPO: 62750 bytes of littpJ7www.gmaa8.com/vh/showthread.php7t 21476 
[2006-1 1-8 13:50:12] IITTPO: Download complete. Status: 200 OK. 

12006-1 1-8 13:50:25] 1ITTP7: Connection Timed out. 



Fig. 4.6 Example log and parsed log entries 

Figure 4.6 shows sample entries from the original and parsed log. The original 
log file shows the download status for each hie (URL). The parsed log shows the 
overall status as well as the reason for download failure (in the case of undown- 
loaded hies). Blue-colored entries relate to downloaded hies while red-colored 
entries relate to undownloaded hies. The log analyzer determines the appropriate 
course of action based on this cause of failure. “File Not Found” URLs are removed 
(not added to respidering list) while “Connection Timed Out” URLs are respidered. 
The recall improvement phase also checks the hie sizes of collected web pages for 
partial/incomplete downloads. Multimedia hie downloads are occasionally manu- 
ally downloaded, particularly larger video hies that may otherwise time out. 



5.4 Forum Storage and Analysis 

The forum storage and analysis phase consists of statistics generation and duplicate 
multimedia removal components. 

5.4.1 Statistics Generation 

Once hies have been collected, they must be stored and analyzed. The statistics 
consist of four major categories: 

• Indexable hies : HTML, Word, PDF, Text, Excel, PowerPoint, XML, and Dynamic 
hies (e.g„ PHP, ASP, JSP). 

• Multimedia hies: Image, Audio, and Video hies. 

• Archive hies: RAR and ZIP. 

• Nonstandard hies: unrecognized hie types. 





Parsed Log 



Connection t imed out: http://news.stcom.net/fiIe viewtopic&t 2121. 
Download Complete: http://www.gmaas.com/vb/sho wthrcad.php?t=2 1 476 
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Fig. 4.7 Dark Web forum crawling system interface 



5.4.2 Duplicate Multimedia Removal 

Dark Web forums often share multimedia files, but the names of those files may be 
changed. Moreover, some multimedia files’ suffixes are changed to other file types’ 
suffixes, and vice versa. For example, an HTML file may be named as a “.jpg.” 
Therefore, simply relying on file names results in inaccurate multimedia file statis- 
tics. We use an open-source duplicate multimedia removal software tool that identi- 
fies multimedia files by their metadata encoded into the file, instead of their suffixes 
(file extensions). It compares files based on their MD5 values, which are the same 
for duplicate video files collected from various Internet sources. MD5 (Message- 
Digest algorithm 5) is a widely used cryptographic hash function with a 128-bit 
hash value. Therefore, it can more accurately differentiate multimedia files from 
other types of files. 



5.5 Dark Web Forum Crawling System Interface 

Figure 4.7 shows the interface for the proposed Dark Web forum spidering system. 
The interface has four major components. The “Forums” panel in the top left shows 
the spidering queue in a table that also provides information such as the forum 
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name, URL, region, when it was last spidered, and whether the forum is still active. 
The “Spidering Status” panel in the top right corner displays information about the 
percentage of board, subboard, and thread pages collected for the current forum 
being spidered. The “Forum Statistics” panel in the bottom left shows the quantity 
and size of the various file types collected for each forum, using tables, pie charts, 
and parallel coordinates. The “Forum Profile” panel in the bottom right shows each 
forum’s membership information and forum spidering parameters, including the 
number of crawlers, URL ordering technique (i.e., BFS or DFS), and URL ordering 
features (e.g., URL tokens and keywords) used to control the crawl space. 



6 Evaluation 

We conducted two experiments to evaluate our system. The first experiment involved 
assessing the effectiveness of our human-assisted accessibility mechanism. Raghavan 
and Garcia-Molina (2001) noted that accessibility is the most important evaluation 
criterion for hidden web research. We describe how effectively we were able to 
access Dark Web forums in our collection efforts using the human-assisted approach 
in comparison with standard spidering without any accessibility mechanism. 

The second experiment entailed evaluating the proposed incremental spidering 
approach that uses recall improvement as a collection updating procedure. We per- 
formed an evaluation of the effectiveness of periodic crawling as compared to stan- 
dard incremental crawling and our incremental crawler which uses iterative recall 
improvement for Dark Web forum collection updating. 



6.1 Forum Accessibility Experiment 

Table 4.2 below presents results on our ability to access Dark Web forums with and 
without a human-assisted accessibility mechanism. Using the human-assisted acces- 
sibility approach, we were able to access over 82% of Dark Web forums hosted by 
various Internet service providers and virtually all of the attempted stand-alone 
forums. The overall results (over 91% accessibility) indicate that the use of a human- 
assisted accessibility mechanism provided good results for Dark Web forums. 



Table 4.2 Dark Web forum accessibility statistics 



Human-assisted accessibility Standard spidering 



Hosted 


Stand-alone 


Total 


Hosted 


Stand-alone 


Total 


forums 


forums 


forums 


forums 


forums 


forums 



Total attempted 


52 


67 


119 


52 


67 


119 


Accessed/collected 


43 


66 


109 


25 


56 


71 


Inaccessible 


9 


1 


10 


27 


11 


48 


% Collected 


82.69 


98.51 


91.60 


48.08 


83.58 


59.66 
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Table 4.3 Dark Web forum 
accessibility statistics 




Human-assisted accessibility 
versus standard spidering 




Hosted forums 
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Total forums 
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Fig. 4.8 Number of web pages in test bed across 3 months/iterations 



In contrast, using standard spidering without any accessibility mechanism resulted 
in only 59.66% of the forums being accessible to collect. The biggest impact of the 
accessibility approach occurred on the hosted forums, where lack of usage of 
human-assisted accessibility resulted in a 34% drop in the number of forums col- 
lected (18 forums). 

Table 4.3 shows the p values for the pairwise t tests conducted to assess the 
improved access performance of the human-assisted accessibility mechanism as 
compared to a standard spidering scheme devoid of any special accessibility method. 
The improved performance was statistically significant at alpha = 0.01 for total per- 
formance as well as both forum types. 



6.2 Forum Collection Update Experiment 

In order to evaluate the effectiveness of the proposed incremental crawling with 
recall improvement approach (referred to as incremental + RI) for collection updat- 
ing, we conducted a simulated experiment in which 40 Dark Web forums were spi- 
dered three times over a 3-month period between December 2006 and February 
2007. Figure 4.8 shows the number of cumulative web pages and the amount of new 
pages appearing in the 40 test bed forums across the 3-month period. There were 
approximately 128,000 unique web pages in the test bed, which were used as 
the gold standard for precision, recall, and F-measure computation. We collected the 
pages on a monthly basis (a total of three iterations) using a periodic, incremental. 
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Table 4.4 Macrolevel 
results for different 
update procedures 



Update procedure 


Precision 


Recall 


F-measure 


Time (min) 


Periodic 


74.32 


69.03 


71.58 


6,101 


Incremental 


57.80 


53.69 


55.67 


4,855 


Incremental + RI 


79.59 


74.74 


77.09 


5,758 



Fig. 4.9 Results by iteration 
for various collection update 
procedures 




Periodic ■ Incremental A Incremental + Rl 



and incremental + RI collection update procedure. The periodic crawler collected all 
pages in each iteration (the cumulative amounts in Fig. 4.8) while the incremental 
crawler only collected the new pages for each iteration (the iterative amounts in 
Fig. 4.8). The advantage of periodic crawling is the ability to ascertain multiple ver- 
sions of a page, which can improve the likelihood of gathering pages uncollected in 
the previous round at the expense of collection time and server congestion. The 
incremental + RI also collected the new pages but used a recall mechanism that 
allowed improperly retrieved pages to be refetched n number of times. The recall 
improvement phase, which identifies uncollected pages based on their spidering 
status and file size, is intended to retrieve uncollected pages in an efficient manner 
(i.e., without putting excessive burden on the forum servers). Consequently, a value 
of n = 2 was utilized since we have found that excessive attempts (i.e., larger values 
of n) typically decrease performance due to server congestion. 

Performance was evaluated using the precision, recall, and F-measures. Precision 
was defined as the percentage of pages downloaded that were correctly collected. 
Correctly collected pages included all relevant pages completely downloaded. 
Incorrect pages were those that were partial/incomplete or irrelevant. Recall was 
defined as the percentage of relevant pages collected. 

Table 4.4 shows the experimental results for the three collection procedures. The 
incremental + RI method achieved the highest precision, recall, and F-measure in a 
more efficient manner than the periodic approach. The incremental update without 
recall improvement was the most efficient timewise; however, it only had an 
F-measure of roughly 55%. The results suggest that Dark Web forums require the 
use of a spidering strategy that entails multiple attempts to fetch uncollected pages. 

Figure 4.9 shows the overall F-measure for the three collection updating proce- 
dures after each spidering iteration. The diagram exemplifies the impact of making 
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Table 4.5 Dark Web forum collection statistics 





Hosted forums 


Stand-alone forums 


Total forums 


Middle East 


21 


50 


71 


Latin America 


6 


3 


9 


US domestic 


16 


13 


29 


Total 


43 


66 


109 



multiple attempts to collect unfetched pages. We can see that the overall perfor- 
mance of periodic crawling improves dramatically during the second and third itera- 
tions since many of the previously uncollected web pages are gathered. Since the 
incremental + IR method retrieves such pages immediately, it maintains a consis- 
tently higher level of performance as compared to the other two methods. 



6.3 Forum Collection Statistics 

We used our spidering system for collection of Dark Web Forums in three regions 
(USA, Middle East, Latin America). The spider was initially run incrementally for 
a 20-month period between April 2005 and December 2006. The spider collected 
indexable, multimedia, archive (e.g., .zip and .rar), and nonstandard files (e.g., those 
with unknown/unrecognized file extensions). 

Table 4.5 shows the number of forums collected per region. The collection con- 
sists of stand-alone and hosted forums. In general, the Middle Eastern groups tend 
to make greater use of stand-alone forums while the US domestic forums are more 
evenly distributed between hosted and stand-alone forums. 

Table 4.6 shows the detailed collection statistics categorized by file type. Our 
system was able to collect a rich assortment of indexable and multimedia files. 

It is interesting to note the large quantities of dynamic and multimedia files. 
Static HTML files, which were predominant on the Internet 10 years ago, have a 
minimal amount of usage in the Dark Web forums. Dynamic hies outnumber static 
HTML hies by a ratio of 10:1 while multimedia hies (particularly images) are also 
present more often. This is partially attributable to the use of various forum software 
packages that generate dynamic thread pages (typically .php hies). 

Table 4.7 lists sample forums incorporated into our system after additional 
rounds of spidering. Lorum “Al-Lirdaws” is a general forum but has subsections 
containing discussions of radical Islamic ideologies and supporting Salah-Jihadi 
organizations. Forum “Alokab” is dedicated to Islamic theology with some radical 
content. Forum “Hawaaworld” is dedicated to Muslim women. Some of its mem- 
bers have shown their sympathy to certain radical groups. Among all seven forums, 
“Hawaaworld,” “Montada,” and “Alsayra” have many more registered members 
than the other four forums. Some forums are extremely popular and have close to a 
million messages. 
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Table 4.6 Dark Web 
forum collection file 
statistics 




# of files 


Volume (bytes) 


Indexable files 


3,001,742 


140,878,063,124 




HTML files 


283,578 


2,942,658,681 




Word files 


2,108 


46,649,107 




PDF files 


16 


8,168,345 




Dynamic files 


2,715,354 


137,178,574,841 




Text files 


657 


2,249,471,937 




Excel files 


1 


177,152 




PowerPoint files 


2 


528,834 




XML Files 


26 


466,706 




Multimedia files 


433,749 


25,833,258,770 




Image files 


422,155 


8,554,125,848 




Audio files 


5,479 


3,664,642,638 




Video files 


6,115 


13,614,490,284 




Archive files 


801 


621,721,139 




Nonstandard files 


443,244 


17,303,588,746 




Total 


3,878,735 


185,017,574,960 



Table 4.7 Sample Dark Web forum statistics 


Name 


Time span 


Number of 
members 


Number 
of threads 


Number of 
messages 


Alokab 


04/10/2005-03/19/2008 


1,232 


3,699 


30,480 


Al-Firdaws 


01/02/2005-12/06/2007 


2,189 


9,359 


39,775 


Montada 


09/28/2000-07/01/2007 


31,654 


93,548 


866,693 


Hdrmut 


11/26/2000-05/18/2008 


1,707 


9,030 


45,937 


Alsayra 


04/05/2001-06/03/2008 


39,230 


42,329 


348,933 


Hawaaworld 


03/27/200-07/01/2008 


59,842 


20,278 


975,695 


Islamic Network 


06/09/2004-05/07/2008 


1,578 


12,003 


87,769 



All forums listed are in Arabic with the exception of Islamic Network, which is in English 



7 Conclusions and Future Directions 

In this chapter, we developed a focused crawler for collecting Dark Web forums. We 
used a human-assisted accessibility mechanism to access identified forums with a 
success rate of over 90%. Our crawler uses language-independent features, including 
URL tokens, anchor text, and level features, in order to allow effective collection of 
content in multiple languages. It also uses forum software-specific traversal strate- 
gies and wrappers to support incremental crawling. The system uses an incremental 
crawling approach coupled with a recall improvement mechanism that continually 
respiders uncollected pages. Such an update approach outperformed the use of a 
standard incremental update strategy as well as the traditional periodic update method 
in a head-to-head comparison in terms of precision, recall, and computation time. 

The system has been able to maintain up-to-date collections of 109 forums in 
multiple languages from three regions: US domestic supremacist, Middle Eastern 
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extremist, and Latin groups. We believe that the proposed forum crawling system 
allows important entry to Dark Web forums which facilitates better accessibility for 
the analysis of these online communities. The collection of such content has signifi- 
cant academic and scientific value for intelligence and security informatics as well 
as various other research communities interested in analyzing the social character- 
istics of Dark Web forums. 

We have identified several important directions for future research. We plan to 
improve the Dark Web forum accessibility mechanism in order to attain higher 
access rates. We also plan to expand our collection efforts to also include weblogs 
and chatting log archives. Additionally, we intend to evaluate the effectiveness of 
multimedia categorization techniques to enhance our ability to collect relevant 
image and video content. 
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Chapter 5 

Link and Content Analysis 



1 Introduction 

The Internet has evolved to become a global platform through which anyone can 
conveniently disseminate, share, and communicate ideas. Despite many advantages, 
misuse of the Internet has become ever more serious. Terrorist organizations, 
extremist groups, hate groups, and racial supremacy groups are using the Web to 
promote their ideology, to facilitate internal communications, to attack their ene- 
mies, and to conduct criminal activities. There have been warnings that terrorists 
may launch attacks on such critical infrastructure as major e-commerce sites and 
governmental networks (Gellman 2002). Insurgents in Iraq have posted Web mes- 
sages asking for munitions, financial support, and volunteers (Blakemore 2004). It 
therefore has become important to obtain intelligence from the Web that permits 
better understanding and analysis of terrorist and extremist groups. We define this 
reverse side of the Web as a “Dark Web,” the portion of the World Wide Web used 
to help achieve the sinister objectives of terrorists and extremists. 

Currently, intelligence from the Dark Web is scattered in diverse information 
repositories through which investigators need to browse manually to be aware of 
their content. Much of the information stored in search engine databases could be 
properly collected and analyzed for transformation into intelligence and knowledge 
that would enhance understanding of terrorists’ activities. However, search engines 
often overwhelm users by producing laundry lists of irrelevant results and creating 
information overload problems. Related but unfocused information makes it diffi- 
cult to obtain a comprehensive description of a terrorist group or a terrorism topic. 
Many Web resources contain information about terrorism, but a relatively small 
proportion comes from terrorist groups themselves, and data on the Web often are 
not persistent and may be misleading. Many terrorist Web sites do not use English, 
so investigators who do not know its language may be unable to understand a site’s 
content. 
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In this chapter, we have addressed the aforementioned problems by proposing and 
implementing a semiautomated methodology for collecting and analyzing Dark Web 
information. Leveraging human preciseness and machine efficiency, the methodol- 
ogy consists of various steps, including collection, filtering, analysis, and visualiza- 
tion of Dark Web information. We used this comprehensive methodology to collect 
and analyze data from 39 Arabic terrorist Web sites and conducted an evaluation of 
the results. This research aimed to study to what extent the methodology can assist 
terrorism analysts in collecting and analyzing Dark Web information. From a broader 
perspective, this research contributes to the development of the new science of 
“Intelligence and Security Informatics (ISI),” the study of the use and development 
of advanced information technologies, systems, algorithms, and databases for national 
security-related applications through an integrated technological, organizational, and 
policy-based approach (Chen 2005; Strickland and Hunt 2005). We believe that many 
existing computer and information systems techniques need to be reexamined and 
adapted for this unique domain to create new insights and innovations. 

The rest of this chapter is structured as follows: Sect. 2 presents a review of ter- 
rorists’ use of information technologies to facilitate terrorism, information services 
for studying terrorism, and advanced techniques for collecting and analyzing terror- 
ism information. Section 3 describes a methodology for collecting and analyzing 
Dark Web information. Section 4 illustrates the use of the methodology in a case 
study of jihad on the Web (where “jihad” is an Islamic term referring to a holy war 
waged against enemies) and discusses the evaluation results. Section 5 concludes 
the study and discusses future directions. 



2 Literature Review 

2.1 Terrorists’ Use of the Web 

Recent studies have shown how terrorists use the Web to facilitate their activities. 
Tsfati and Weimann used the names of terrorist organizations to search six search 
engines and found 16 relevant sites in 1998 and 29 such sites in 2002 (Tsfati and 
Weimann 2002). Their analysis of site content revealed heavy use of the Web by 
terrorist organizations to share ideology, to provide news, and to justify use of vio- 
lence. Relying on open source information (e.g., court testimony, reports, Web 
sites), researchers at the Institute for Security Technology Studies identified five 
categories of terrorist use of the Web (Technical Analysis Group 2004): propaganda 
(to disseminate radical messages), recruitment and training (to encourage people to 
join the jihad and get online training), fundraising (to transfer funds, conduct credit 
card fraud, and other money laundering activities), communications (to provide 
instruction, resources, and support via e-mail, digital photographs, and chat ses- 
sion), and targeting (to conduct online surveillance and identify vulnerabilities of 
potential targets such as airports). Among these, using the Web as a propaganda tool 
has been widely observed. 
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Identified by the US Government as a terrorist site, Alneda.com called itself the 
“Center for Islamic Studies and Research,” a bogus name, and provided information 
for al-Qaeda (Thomas 2003). To group members (insiders), terrorists use the Web to 
share motivational stories and descriptions of operations. To mass media and non- 
members (outsiders), they provide analysis and commentaries of recent events on 
their Web sites. For example, Azzam.com urged Muslims to travel to Pakistan and 
Afghanistan to fight the “Jewish-backed American Crusaders.” Qassam.net appealed 
for donations to purchase AK-47 rifles (Kelley 2002). Al-Qaeda and some humani- 
tarian relief agencies used the same bank accounts via www.explizit-islam.de 
(Thomas 2003). 

Terrorists also share ideologies on the Web that provide religious commentaries 
to legitimize their actions. Based on a study of 172 members participating in the 
global Salafi jihad, Sageman concluded that the Internet has created a concrete bond 
between individuals and a virtual religious community (Sageman 2004). His study 
reveals that the Web appeals to isolated individuals by easing loneliness through 
connections to people sharing some commonality. Such virtual community offers a 
number of advantages to terrorists. It no longer ties to any nation, fostering a prior- 
ity of hghting against the far enemy (e.g., the USA) rather than the near enemy. 
Internet chat rooms tend to encourage extreme, abstract, but simplistic solutions, 
thus attracting most potential jihad recruits who are not Islamic scholars. The ano- 
nymity of Internet cafes also protects the identity of terrorists. However, Sageman 
does not consider the Internet to be a direct contact with jihad because devotion to 
jihad must be fostered by an intense period of face-to-face interaction. In addition, 
existing studies about terrorists’ use of the Web mostly use a manual approach to 
analyze voluminous data. Such an approach does not scale up to the rapid growth 
and frequent change of terrorists’ identities on the Web. 



2.2 Information Services for Studying Terrorism 

Despite the public nature of the Web, terrorists often try to prevent authorities from 
tracing their Web addresses and activities, which has prompted several information 
services to monitor the Web sites of militant Islamic groups and to provide access to 
translated versions of information posted there. The Jihad and Terrorism Project 
was developed by the Middle East Media Research Institute to bridge the language 
gap between the West and the Middle East by providing timely translations of 
Arabic, Farsi, and Hebrew documents (Middle East Media Research Institute 2004). 
The Project for the Research of Islamist Movements (www.e-prism.org) studies 
radical Islam and Islamist movements, focusing primarily on Arabic sources. These 
projects provide access to an array of information such as translated news stories, 
transcripts, video clips, and training documents produced by terrorists but fall short 
of supporting analysis and visualization of terrorist data from the Dark Web (Project 
for the Research of Islamist Movements 2004). 
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2.3 Advanced Information Technologies 
for Combating Terrorism 

Since the 9/11 attacks, there has been increased interest in using information tech- 
nologies to counter terrorism. A study conducted by the US Defense Advanced 
Research Projects Agency shows that their collaboration, modeling, and analysis 
tools speeded analysis (Popp et al. 2004), but these tools were not tailored to collect- 
ing and analyzing Web information. Although new approaches to terrorist network 
analysis have been called for (Carley et al. 2001), existing efforts have remained 
mostly small scale; they have used manual analysis of a specific terrorist organiza- 
tion and did not include resources generated by terrorists in their native languages. 
For instance, Krebs manually collected data from English news releases after the 
9/11 attacks and studied the network surrounding the 19 hijackers (Krebs 2001). 
Although automated social network analysis techniques have been proposed to ana- 
lyze and portray criminal networks, it is not clear whether the techniques are appli- 
cable to the mostly unstructured data in terrorist Web sites that contain textual and 
multimedia data (Xu and Chen 2005). Their use of structured data in a police depart- 
ment database also does not help understand terrorist Web sites. Other advanced 
information technologies having potential to help analyze terrorist data on the Web 
include information visualization and Web mining. 

Information visualization technologies have been used in many domains (Zhu 
and Chen 2005) such as criminal analysis (Chung et al. 2005a) and business stake- 
holder analysis (Chung 2007). For example, multidimensional scaling (MDS) algo- 
rithms consist of a family of techniques that portray a data structure in a spatial 
fashion, where the coordinates of data points are calculated by a dimensionality 
reduction procedure (Young 1987). MDS has been used in many different applica- 
tions. Chung and his colleagues developed a new browsing method based on MDS 
to depict the competitive landscape of businesses on the Web (Chung et al. 2005b). 
He and Hui applied MDS to display author cluster maps in their author co-citation 
analysis (He and Hui 2002). Eom and Farris applied MDS to author co-citation in 
decision support systems (DSS) literature from 1971 through 1990 in order to find 
contributing fields to DSS (Eom and Farris 1996). Kealy applied MDS to studying 
changes in knowledge maps of groups over time to determine the influence of a 
computer-based collaborative learning environment on conceptual understanding 
(Kealy 2001). Although much has been done in different domains to visualize rela- 
tionships of objects using MDS, no attempts to apply it to discovering terrorists’ use 
of the Web have been found. 

Web mining is the use of data mining techniques to automatically discover and 
extract information from Web documents and services (Chen and Chau 2004; 
Etzioni 1996). Chen et al. (2001) showed that the approach of integrating meta- 
searching with textual clustering tools achieved high precision in searching the 
Web. Web page classification, a process of automatically assigning Web pages into 
predefined categories, can be used to assign pages into meaningful classes (Mladenic 
1998). Web page clustering, a process of identifying naturally occurring subgroups 
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among a set of Web pages, can be used to discover trends and patterns within a large 
number of pages (Chen et al. 1996). Although a number of Web mining technolo- 
gies exist (e.g., Chen and Chau 2004; Last et al. 2006), there has not yet been a 
comprehensive methodology to address problems of collecting and analyzing ter- 
rorist data on the Web. Unfortunately, existing frameworks using data and text min- 
ing techniques (e.g., Nasukawa and Nagano 2001; Trybula 1999) do not address 
issues specific to the Dark Web. 

To our knowledge, few studies have used advanced Web and data mining tech- 
nologies to collect and analyze terrorist information on the Web, though these tech- 
nologies have been widely applied in such other domains as business and scientific 
research (e.g., Chung et al. 2004; Marshall et al. 2004). New approaches to collect- 
ing and analyzing terrorist information on the Web are needed. 



3 A Methodology for Collecting and Analyzing 
Dark Web Information 

3.1 The Methodology 

To address threats from the wide range of information sources that terrorists and 
extremists use to spread their ideas and to conduct destructive activities, we have 
proposed a semiautomated methodology integrating various information collection 
and analysis techniques and human domain knowledge. Figure 5.1 shows the meth- 
odology aiming to effectively assist human investigators to obtain Dark Web intel- 
ligence using information sources, collection methods, filtering, and analysis. 

• Information sources consist of a wide range of providers of terrorist or terrorism 
information on the Web. Some of these are readily accessible (e.g., search 
engines), while some, like terrorism incident databases and Web sites developed 
and maintained by terrorists and their supporters, can only be reached with the 
help of domain experts. 

• Collection methods make possible automatic searching, browsing, and harvesting 
of information from identified sources. Domain spidering starts with a set of rel- 
evant seed URLs and relies on an automatic Web page collection program, often 
called a spider or crawler, to harvest Web pages linked to the seed URLs. Back- 
link search, supported by some search engines such as Google (www.google. 
com) and AltaVista (www.altavista.com, acquired by Overture that was then 
acquired by Yahoo! in 2003), allows searching of Web pages that have hyperlinks 
pointing to a target Web domain or page. It helps investigators trace activities of 
terrorist supporters and sympathizers, whose Web pages often reference terrorist 
sites (e.g., glorify martyrs’ actions, show a concurrence of terrorist attacks). 
Group/personal profile search, exemplified by major Web portals such as Yahoo! 
(members.yahoo.com) and MSN (groups.msn.com), reveals the profiles of groups 
or individuals who share the same interests. Terrorists and their supporters may 
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Fig. 5.1 A methodology for collecting and analyzing Dark Web information 



perhaps put “hot links” in their profiles, which allow investigators to discover 
hidden linkages. Meta-searching uses related keywords as input to query multiple 
search engines, from which investigators or automated programs can collate top- 
ranked results and filter out duplicates to obtain highly pertinent URLs of terrorist 
Web sites. With careful formulation of search terms and appropriate linguistic 
knowledge, they can obtain highly relevant results. For example, searching the 
Arabic name of “Osama bin-Laden” Lo-L" M<j J'-u) in multiple search engines 
returns mixed results about terrorist news articles and terrorist Web sites, while 
augmenting “Osama bin-Laden” with the keyword “Sheikh” (the head of tribe or 
leader in Arabic), which is frequently used by al-Qaeda to refer to bin-Laden, can 
give more relevant terrorist and supporter Web sites. Downloading from Internet 
archives and forums exploits the temporal dimension of Web information. For 
instance, the Internet Archive (www.archive.org) offers access to historical snap- 
shots of Web sites. Usenet discussion forums provide a wealth of textual com- 
munication that can be mined for hidden patterns over time. 

• Filtering involves sifting through collected information and removing irrelevant 
results, but to perform this task requires domain knowledge and linguistic knowl- 
edge. Domain knowledge refers to knowledge about terrorist groups, their rela- 
tionships with other terrorist and supporter groups, their presence on and usage 
of the Web, as well as their histories, activities, and missions. Linguistic knowl- 
edge deals with terms, slogans, and other textual and symbolic clues in the native 
languages of the terrorist groups. Filtering can be automatic or manual, depend- 
ing on requirements for efficiency of process and precision of the results. 
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Typically, manual filtering achieves high precision but is less efficient and relies 
on domain experts who have had years of experience in the field. Automatic fil- 
tering is very efficient as it often uses computers and machine learning to process 
large amounts of data, but the results are less precise. Investigators can obtain 
high-quality data for analysis from filtered repositories. 

• Analysis provides insights into data and helps investigators identify trends and 
verify conjectures. Several functions support these analytical tasks. Indexing 
relates textual terms to individual Web pages, thereby supporting precise search- 
ing of the pages. Extraction identifies meaningful entities such as terrorist names, 
frequently used slogans, and suspicious terms. Classification finds common 
properties among entities and assigns them to predefined categories to help 
investigators predict trends of terrorist activities. Clustering organizes entities 
into naturally occurring groups and helps to identify similar terrorist groups and 
their supporters. Visualization presents voluminous data in a format perceivable 
by human eyes, so investigators can picture the relationship within a network 
organization of terrorist groups and can recognize their underlying structure. 



3.2 Discussion of the Methodology 

Although the Internet has been publicly available since the 1990s, the Dark Web 
emerged only in recent years. A lack of useful methodology designed for Dark Web 
data collection and analysis has limited the capability to fight against terrorism. As 
discussed above, the proposed methodology has incorporated various data and Web 
mining technologies while still allowing human domain knowledge to guide their 
application. Its semiautomated nature combines machine efficiency with the advan- 
tages of human precision, a useful complement to computers that usually fail to 
detect deception and ambiguity on the Dark Web. Its coverage of wide varieties of 
data sources and techniques ensures a comprehensive Dark Web data collection, a 
challenge often faced by terrorism and intelligence analysts. Therefore, the method- 
ology and its integration and application of data and Web mining technologies to 
Dark Web analysis are novel contributions to the ISI research. 



4 Jihad on the Web: A Case Study 

To demonstrate the value and usability of our methodology, we have applied it to 
collecting and analyzing the use of the Web for “jihad,” an Islamic term referring to 
a holy war waged against enemies as a religious duty. Believers contend that those 
who die in jihad become martyrs and are guaranteed a place in paradise. In recent 
decades, the concept of jihad has been used as an ideological weapon to combat 
against Western influences and secular governments and to establish an ideal Islamic 
society (Encyclopedia Britannica Online 2007). Jihad supporters are closely related 
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to terrorist groups while maintaining anonymity using the Web. For example, prior 
to the 9/1 1 attacks, al-Qaeda members sent each other thousands of messages in a 
password-protected section of an extreme Islamic Web site (Anti-Defamation League 
2002). Terrorist groups such as Hamas, Hezbollah, and Palestinian Islamic Jihad 
also use Web sites as propaganda tools. We describe the steps of applying the meth- 
odology as follows (see Fig. 5.1). The data described below were collected in 2004. 



4.1 Application of the Methodology 

4.1.1 Collection 

To collect data, we first identified four suspicious URLs through Web searching, 
referencing to published terrorism reports, and performing personal profile searches 
on Yahoo! (For example, we searched “hezbollah” in Google where we found its 
URL among the top-ranked results.) These URLs are Palestinian Islamic Jihad (PIJ) 
(www.qudsway.com), Hezbollah (www.hezbollah.org), the military wing of Hamas 
(www.ezzedeen.net), and an Arabic Web site with a pro-jihad forum (www.al-imam. 
net). A 2003 US Department of State report confirmed PIJ, Hezbollah, and Hamas 
to be terrorist or terrorist-affiliated groups (Department of State 2003). Though 
Al-Imam.net is not classified as a terrorist organization, it contains pro-jihad forums 
in which messages and links to terrorist Web sites are posted. We then used the 
back-link search function of Google to obtain several hundred URLs that point to 
the four suspicious URLs. As Dark Web information can be scattered in many dif- 
ferent sources and can be changed quickly over time, the several methods used to 
identify the four initial URLs enabled us to cover a broader scope and a more timely 
content than relying only on published reports (e.g., US Department of State’s 
annual report). While different initial URLs and different times of data collection 
could affect the content of the data collected, we believe these four URLs are repre- 
sentative of the Dark Web. It would be an interesting future direction to study the 
extent to which data collection affects the quality of analysis results. 



4.1.2 Filtering 

We conducted two rounds of filtering. First, we manually filtered out unrelated sites, 
such as news or governmental Web sites that report or discuss only terrorist activi- 
ties, religious Web sites with no reference to jihad or violence, and political Web 
sites where there is no mention or approval of terrorist activities. We retained Web 
sites of terrorist organizations, those of terrorist leaders, and those that praise terror- 
ists or their actions. Forty-six sites remained after this round of filtering. 

Second, with the help of a native Arabic speaker (who is not a terrorism expert), 
we manually added 14 terrorist and supporter sites identified by querying Google 
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with the keywords (in Arabic) that we had found in the terrorist and supporter sites. 
Such keywords included the leaders’ and organizations’ names in Arabic (“mojahe- 
din iran,” “markaz dawa,” A) J |j u” etc.). To limit the scope of 

analysis, we considered only the top 50 results returned from the search engine in 
each query search. In addition, we manually removed 21 sites from the set of all 
sites obtained based on their relevance to the domain. This round of filtering and 
refining resulted in 39 Arabic Web sites - 24 terrorist sites and 15 supporter sites. 



4.1.3 Analysis 



We performed clustering, classification, and visualization on the 94,326 Web pages 
collected by crawling the 39 terrorist and supporter sites using an exhaustive 
breadth-first search spidering program (with a maximum depth of 10 levels). The 
first analysis task we performed was clustering, in which we considered as input the 
46 Web sites identified from the first round of filtering (see paragraph 1 of Sect. 4.1.2). 
The clustering involves calculating a similarity between each pair of Web sites in 
our collection to uncover hidden Web communities. We define similarity to be a 
real-valued multivariable function of the number of hyperlinks in one Web site 
(“A”) pointing to another Web site (“B”), and the number of hyperlinks in the latter 
site (“B”) pointing to the former site (“A”). In addition, a hyperlink is weighted 
proportionally to how deep it appears in the Web site hierarchy. For instance, a 
hyperlink appearing on the homepage of a Web site is given a higher weight than 
hyperlinks appearing at a deeper level. Specifically, the similarity between Web 
sites “A” and “B” is calculated as follows: 



Similarity 



(A, B) 



I 

All links L 
b/w A and B 



1 

l + lv (L) 



where lv(L) is the level of link L in the Web site hierarchy, with homepage as level 
0 and the level increased by 1 with each level down in the hierarchy. Using these 
heuristics, a computer program automatically extracted hyperlinks on Web pages 
and calculated their similarities. 

In the second analysis task, we classified the sites by their affiliations with terror- 
ist groups, ideologies, and religions, and by their Web site attributes. Our native 
Arabic speaker manually identified the affiliations of all the Web sites according to 
their site content. Although we had the help of the Arabic speaker, the components 
of the methodology are generic enough to be applicable to other domains. The 
choice of this Arabic speaker, who is not a terrorism expert, also would not affect 
the results. Table 5.1 shows the details of the Web sites and their affiliations. 

In addition to using affiliations, we classified the sites by indicating how terror- 
ists and their supporters use the Web to facilitate their activities. From our literature 
review, we identified six types of terrorist use of the Web and 27 unique Web site 
attributes. Table 5.2 presents these attributes categorized under the six types. 



Table 5.1 Analysis of jihad terrorist groups and their supporters’ sites 

No. Name URL 8 Description 18 Terrorist group' Religion 

Terrorist groups’ Web sites (total: 24) 

1. Special Force www.specialforce.net Provides computer game replicating the Hezbollah Shi’a Muslim 

fighting scenes between Lebanese 
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ideology 

Alokab www.alokab.com Provides articles about the Salafi Sunni al-Qaeda Sunni Muslim 

ideology and the jihadist movement 

(continued) 



Alsakifah Forum www.alsakifah.org Provides educational services and a forum al-Qaeda Sunni Muslim 

dedicated to the discussion of the Saiafi 
ideology 
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b The descriptions are obtained from the Web sites 

c Descriptions of these terrorist groups appear in the US Department of State Report Pattern of Global Terrorism, 2002 



Table 5.2 Categories of terrorist use of the Web and Web site attributes 



Category 


Attribute 


Description 


Communications 


E-mail 


Any listed e-mail address or feedback form 




Telephone 
(including 
Web phone) 


Telephone numbers of organization officials 




Multimedia tools 


Video clips of bombings and other activities. Video, 
sound recording, and games (e.g., leader’s messages 
and instructions) 




Online feedback 


Allow the user to give feedback or ask questions to the 




form 


Web site owners and maintainers 




Documentation 


Report, book, letter, memo, and other resources provided 
(e.g., in pdf, Word, Excel, other formats) 


Fundraising 


External aid 
mentioned 


Other groups or governments supporting the organization 




Fund transfer 


Fund transfer methods 




Donation 


Donations under the form of direct bank deposits 




Charity 


Donations to religious welfare organizations associated 
with terrorist organization 




Support groups 


Suborganizational structures charged with the fundraising 
program 




Others 


Other attributes belonging to this category 


Sharing ideology 


Mission 


The major goals of the organization (e.g., destruction of an 
enemy state, liberation of occupied territories) 




Doctrine 


The beliefs of the group (e.g., religious, communist, 
extreme right) 




Justification 


Ideology condones the use of violence to accomplish 




of the use 
of violence 


goals (e.g., suicide bombing) 




Pinpointing 


Classifies others as either enemies or friends (e.g., USA is 




enemies 


enemy, Taliban regime is friendly) 


Propaganda 


Slogans 


Short phrases with religious or ideological connotations 


(insiders) 


Dates 


Mentions dates in the history of the terrorist group such 
as the date of a major attack 




Martyr’s 


Lists the names of members who died in terrorism-related 




description 


operations or descriptions of the circumstances 




Leader’s name(s) 


Terrorist groups leader(s) name as claimed by the Web site 




Banner and seal 


Banner depicting representative figures, graphical 
symbols, or seals of the organization 




Narratives of 


Provides narratives of the operations and attacks of the 




operations 
and events 


group 




Others 


Other attributes belonging to this category 


Propaganda 


Reference to 


For example, the Web site criticizes Western media 


(outsiders) 


media 


coverage of events with explicit mention of outlets 




coverage 
of events 


such as CNN and CBS 




News reporting 


Group’s own interpretation of events 


Virtual 


Listserv 


Automatic mailing list server that broadcasts to everyone 


community 




on the list 




Text chat room 


Virtual room where a chat session takes place. Text 
messaging chat session such as ICQ 




Message board 


Allows members to post and read messages online 




Web ring 


A series of Web sites linked together in a ring that by 
clicking through all of the sites in the ring the visitor 
will eventually come back to the originating site 
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^ hBp 



6. Hizb-Ut-Tahrir Cluster 



1 . Hizballah Cluster 



5. Jihad Supporters 



w m*ktab-al-jihad.com 



4. Caucasian Cluster 



2. Palestinian 
L Cluster 



3. Al-Qaeda Cluster 



Fig. 5.2 Clustering and visualization of terrorist Web sites (the numbers refer to those appearing 
in Table 5.1) Some Web sites in Table 5 . 1 do not appear in this figure because they were added after 
the first-round filtering 



Following this coding scheme, the Arabic speaker manually read through all of the 
subject Web pages to record terrorist uses of the Web. Similar to that used in study- 
ing the openness of government Web sites (La Porte et al. 1999), our coding involved 
finding whether an attribute existed on the Web sites (i.e., binary scoring). Manual 
coding of each Web site required 45 min to 1 h. 

To reveal patterns of terrorist Web site existence and degree of a site’s activities, 
in the third analysis task, we performed two types of visualization: multidimen- 
sional scaling and snowflake visualizations. 

Multidimensional scaling visualization provided a high-level picture of all the 
terrorist groups and their relationships. We used multidimensional scaling (MDS) 
to transform a high-dimensional similarity matrix to a set of two-dimensional coor- 
dinates (Young 1987). While other visualization techniques might have been appli- 
cable, we chose MDS because it suits the current data structure and provides a 
vivid picture summarizing terrorist groups’ relationships. Figure 5.2 shows these 
relationships, in which the sites appear as nodes and the lines connect pairs of sites 
that have at least one hyperlink pointing from one site to another. Using the similar- 
ity matrix as input, the MDS algorithm calculated coordinates of each site and 
placed the sites on a two-dimensional space where proximity reflects similarity. 
Upon closer examination of the figure, seven clusters of sites emerge. (The num- 
bers in parentheses refer to the sites in Table 5.1. The URLs were filtered out in 
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the second-round filtering but appeared in the collection after the first-round 
filtering.): 

1. Hezbollah Cluster (#7, 11, 12, hezbollah.org, and intiqad.org) contains the Web 
site of Hezbollah group (www.hezbollah.org) and its affiliated sites such as 
Hezbollah E-magazine (www.intiqad.org), Hezbollah Support Association (#11), 
and the site of Sayyed Hassan Nasrollah (#12), a major leader of Hezbollah. 

2. Palestinian Cluster (# 4, 5, 6, 9, 13, 14, 15, 36, and h4palestine.com) includes 
militant groups fighting against Israel (e.g., Al-Aqsa Martyrs Brigade, Hamas). 
There are links between sites of the same group (e.g., # 4 and 14) and links 
between sites of different groups (e.g., # 9 and 6). 

3. Al-Qaeda Cluster (# 26, 28, 31, 35, 37, and sahwah.com) includes Salafi groups’ 
supporters’ Web sites that often are linked to each other in their “Other friendly 
Web sites” section. They use their Web sites heavily to propagate their ideology. 
For example, Al-ansar.biz posted a video of the beheading of Nicholas Berg, one 
of the first civilians killed by terrorists (Newman 2004). Alsakifah.org provides 
an online discussion forum. 

4. Caucasian Cluster (# 10, 34, kavkazcenter.com, kavkaz.tv, kavkazcenter.net, 
and kavkazcenter.info) consists of Web sites that link to Chechen rebels and 
provide news updates from Chechen areas. For example, Qoqaz.com has docu- 
mented operations against the Russian military. 

5. Jihad Supporters (# 29, 30, 32, 33, clearguidance.blogspot.com, and ummanews. 
com) consists of Web sites providing news and general information on the global 
jihad movement. These sites rarely are linked to each other and often play a pro- 
paganda role that targets outsiders. 

6. Hizb-Ut-Tahrir (# 27, hizb-ut-tahrir.org, expliciet.nl, khilafah.com, and hilafet. 
com) contains a nonterrorist political group, Hizb-Ut-Tahrir, dedicated to the 
restoration of Islamic law and Khilafah (global leadership of Muslims). It has a 
presence in many Arab countries (e.g., Lebanon, Jordan) and some European 
countries. For instance, Expliciet.nl is a Dutch Web site based in the 
Netherlands. 

7. Tanzeem-e-Islami Cluster (tanzeem.org) consists of a single site representing the 
Pakistani “Tanzeem-e-Islami” party with no clear ties to terrorism. 

Snowflake visualization supports analysis of different dimensions (or categories) 
of activities of a Web site cluster. It originates from a star plot that has been widely 
used to display multivariate data (Chambers et al. 1983). A snowflake (shown in 
Fig. 5.3) represents a terrorist site cluster. Figure 5.3 shows five snowflake dia- 
grams, each representing the degree of activity of terrorist/supporter groups in the 
five terrorist clusters (clusters 1-5) described above (clusters 6 and 7 are not included 
because they do not contain terrorist sites). The six sides of a snowflake represent 
the six dimensions of terrorist use of the Web, as shown in Table 5.2 and explained 
above. Each of these six dimensions represents a normalized scale between 0 and 1 
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Cluster 1 : Hizballah Cluster Cluster 2: Palestinian Cluster Cluster 3: Al-Qaeda Cluster 




Fig. 5.3 Snowflake visualization of five terrorist site clusters 



(activity index), showing the degree of activity on the dimensions. The activity 
index of cluster c on dimension d was calculated by the following formula: 



n m 




Activity Index (c,d ) = — — - — 
mXn 



where 



[ 1 attributeioccursinwebsite j 
[0 otherwise 



n is the total number of attributes in the specified dimension d; 
m is the total number of Web sites belonging to the specified cluster c. 

The closer the activity index is to 1, the more active on that dimension a cluster 
is. This index reveals in what areas the terrorist groups are active, and hence pro- 
vides investigators and analysts with clues about how to devise strategies to combat 
a group. 



5 Results and Discussion 
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5 Results and Discussion 

Our preliminary observations show that the methodology yielded promising results. 
For example, it identified Web sites affiliated with 10 of the 26 groups classified as 
jihad terrorist organizations in the US State Department report on terrorism. Al-ansar. 
biz (# 25), the site that posted the beheading video of Nicholas Berg, posted messages 
from al-Qaeda leaders such as Osama bin-Laden, Ayman Al-Zawahiri, and 
Al-Zarqawi, praising their attacks on enemies. Another site, Tawhed.com (# 38), 
posted a poem praising the 9/11 attacks. The rhetoric of the poem commonly appears 
in many al-Qaeda-affiliated Web sites, referring to the American as crusaders 
Words like Sunna and Jama’h ('Jcnj 5 reflect the branch of 

Islam to which the Salafi groups belong. 

From the snowflake diagrams (Fig. 5.3), we found that terrorists and supporters 
use the Web heavily to share ideology and to propagate ideas, especially to their 
members. For example, the Palestinian cluster (cluster 2) actively shares its ideology 
and heavily uses the Web as a propaganda tool for members. The Web sites in this 
cluster support liberation of Palestine, pinpoint and criticize their enemies, and 
describe details of operations and rationales supported by Quaran verses. In contrast, 
jihad supporters (cluster 5) rarely use the Web for propaganda but share ideology 
and communicate there. The Hezbollah cluster (cluster 1) resembles the Palestinian 
cluster in heavy use of the Web for sharing ideology and insider propaganda. For 
example, the sites in this cluster glorify martyrs and leaders and also were used mod- 
erately for outsider propaganda and communications. In the five clusters, we found 
little evidence of using the Web for fundraising or building a virtual community. 
Probably such uses have gone underground or do not appear on the Web. 



5.1 Expert Evaluation and Results 

Based on the above results, we invited a terrorism expert to conduct an evaluation of 
the methodology. A senior fellow of the US Institute of Peace at Washington, D.C., 
the expert is a professor of communication in a major research university in Israel. 
Having expertise in modern terrorism and the Internet, he has published more than 80 
refereed journal articles and books and is a frequent speaker at international confer- 
ences on counterterrorism. This expert also leads a team of about 16 research assis- 
tants who regularly monitor 4,300 sites on the Dark Web for terrorist activities. The 
approach he and his team use to collect and analyze terrorists 7 use of the Web is 
largely manual, relying on laborious human browsing and monitoring of selected Web 
sites. His experience in manual analysis served to contrast with our methodology that 
automated part of the Dark Web data collection and analysis. We decided to use expert 
validation instead of other evaluation methods for two reasons: (1) lab experiment is 
not suitable because typical experimental subjects do not have much knowledge in the 
Dark Web, and (2) it is not feasible to invite terrorists to participate in an interview or 
empirical evaluation. The expert was not involved in writing this paper. 
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The evaluation was conducted using an unbiased structured questionnaire and a 
formal procedure. We showed the results to our expert and asked him to provide 
detailed comments on the categorization of Web sites and attributes, the visualiza- 
tion and clustering of terrorist groups, and the usability of the snowflake visualiza- 
tion. In general, he deemed the results to be very promising and the methodology 
design to be excellent. He believed that this was the start of very important research 
that will result in a useful database and a reliable methodology to update and main- 
tain the database. 

The expert was greatly impressed by the visualization and clustering capabilities 
of the methodology and provided valuable comments on our work. However, he 
said that the 39 Web sites shown in Table 5.1 do not represent the entire population 
of all terrorist Web sites, the number of which he estimated to be over 4,000. Because 
we focused only on Middle Eastern terrorist groups (rather than all terrorist groups 
in the world), we believe that our methodology has yielded representative results 
and has automated much of the manual work of identifying and analyzing terrorist 
Web sites. He suggested adding qualitative measures, such as persuasive appeals, 
rhetoric, and attribution of guilt, to the Web site attributes shown in Table 5.2. We 
believe that these important attributes are difficult to incorporate into the automated 
processing of our methodology because of their qualitative nature. He considered 
the clustering and visualization shown in Fig. 5.2 to be very important because of its 
usefulness to investigation of terrorist activities on the Web. He called the snowflake 
visualization very accurate and very useful for investigation of terrorist Web sites 
but criticized the way we created linkages among Web sites. He suggested consider- 
ing textual citations and other references in addition to using only hyperlinks. 

Overall, the expert agreed that the results were very promising because they offer 
useful investigation leads and would be very helpful to improve understanding 
of terrorist activities on the Web. Because of the high qualification and relevant 
experience of this expert, we believe that the evaluation results accurately reflect the 
effectiveness of the methodology. These results also contributed to advancing the 
ISI discipline by showing the applicability of the methodology to Dark Web data 
collection and analysis. 



6 Conclusions and Future Directions 

Collecting and analyzing Dark Web information has challenged investigators and 
researchers because terrorists can easily hide their identities and remove traces of 
their activities on the Web. The abundance of Web information has made it difficult 
to obtain a comprehensive picture of terrorists’ activities. In this chapter, we proposed 
a methodology to address these problems. Using advanced Web mining, content anal- 
ysis, visualization techniques, and human domain knowledge, the methodology 
exploited various information sources to identify and analyze 39 jihad Web sites. 
Information visualization was used to help identify terrorist clusters and to under- 
stand terrorist use of the Web. Our expert evaluation showed that the methodology 
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yielded promising results that would be very useful to assist investigation of terrorism. 
The expert considered the visualization results very useful, having potential to guide 
policy making and intelligence research. Therefore, this research has contributed to 
developing a useful methodology for collecting and analyzing Dark Web information, 
applying the methodology to study and analyze 39 jihad Web sites, and providing 
formal evaluation results of the usability of the methodology. 

We are pursuing a number of directions to further our research. As terrorists 
often change their Web sites to remove traces of their activities, we plan to archive 
the Dark Web content digitally and to apply our methodology to tracing terrorist 
activities over time. We will develop scalable techniques to collect such volatile yet 
valuable content, to visualize large volumes of Dark Web data, and to extract mean- 
ingful entities from terrorist Web sites. These efforts will help investigators trace 
and prevent terrorist attacks. 
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Chapter 6 

Dark Network Analysis 



1 Introduction 

In recent years, scientists have revealed the topological properties of a wide variety 
of complex systems characterized as large-scale networks, such as scientific col- 
laboration networks, the World Wide Web, the Internet, electric power grids, and 
biological networks, among many others. Despite the enormous variation in their 
components, functions, and sizes, these networks are surprisingly similar in topol- 
ogy, leading to the conjecture that complex systems are governed by the ubiquitous 
self-organizing principle. 

One missing piece in this picture, however, is the analysis on the topology of 
“dark” networks that are hidden from view yet could have a devastating impact on 
our society and economy. Terrorist networks, drug-trafficking rings, arms smug- 
gling networks, gang networks, and many other covert networks are all dark net- 
works. The structure of dark networks is largely unknown due to the difficulty of 
collecting and accessing reliable data. Do dark networks share the same topological 
properties with other types of empirical networks? Do they follow the same organiz- 
ing principle? How do they achieve efficiency under constant surveillance and 
threats from authorities? How robust are they against attacks? In this chapter, we 
report the topological properties of several covert criminal- or terrorist-related net- 
works. We hope not only to contribute to the general understanding of structural 
properties of complex systems in a hostile environment but also to provide authori- 
ties with insights regarding disruptive strategies. 



2 Topological Analysis of Networks 

Topological analysis, which focuses on the statistical characteristics of network struc- 
ture, is a new methodology for studying large-scale networks (Albert and Barabasi 
2002; Watts and Strogatz 1998). Large complex networks can be categorized into 
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Table 6.1 The statistics for studying network topology 
Statistics Description 



Average path length, / 

Average clustering 
coefficient, C 

Average degree, <k> 

Degree distribution, p(k ) 
Link density, cl 

Assortativity, r 
Global efficiency, e 



The average of the lengths of the shortest paths between all pairs of 
nodes in a network 

The average of all individual clustering coefficients, C , which is the 
number of links that actually exist among node V s neighbors 
over the possible number of links among these neighbors 
The average of all individual degrees, k P which is the number of 
links node i has 

The probability that an arbitrary node has exactly k links 
The number of links that actually exist over the possible number of 
links in a network 

The Pearson correlation between the degrees of two adjacent nodes 
The average of the inverses of the lengths of the shortest paths over 
all pairs of nodes in a network 



three types: random, small-world, and scale-free. A number of statistics have been 
developed to study the topology of networks. Table 6. 1 lists several of these statistics, 
among which average path length , average clustering coefficient , and degree distri- 
bution have been widely used to categorize networks into different types. 

In random networks, two arbitrary nodes are connected with a probability p, and 
as a result, each node has roughly the same number of links. Random networks are 
characterized by small /, small C, and bell-shaped Poisson distributions (Albert and 
Barabasi 2002). A small / means that an arbitrary node can reach any other node in 
a few steps. A small C implies that random networks are not likely to contain clus- 
ters and groups. Studies have found that most complex systems are not random but 
present small-world and scale-free properties. 

The small-world and scale-free models are different from the random graph 
model. A small-world network has a significantly larger C than its random network 
counterpart while maintaining a relatively small / (Watts and Strogatz 1998). Scale- 
free networks, on the other hand, are characterized by the power-law degree distri- 
bution, meaning that while a large percentage of nodes in the network have just a 
few links, a small percentage of the nodes have a large number of links (Albert and 
Barabasi 2002). It is believed that scale-free networks evolve following the self- 
organizing principle, where growth and preferential attachment play a key role in 
the emergence of the power-law distribution. Especially, preferential attachment 
implies that the more links a node has, the more new links it can attract, manifesting 
the “rich-get-richer” phenomenon. 

The analysis on the topology of complex systems has important implications for 
our understanding of nature and society. Research has shown that the function of 
a complex system may be affected to a great extent by its network topology. 
For instance, the small average path length of the World Wide Web makes cyber- 
space a very convenient, strongly navigable system, in which any two web pages 
are on average only 19 clicks away from each other. It has also been shown that the 
higher tendency for clustering in metabolic networks corresponds to the organiza- 
tion of functional modules in cells, which contributes to the behavior and survival 
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of organisms. In addition, networks with scale-free properties are highly robust 
against random failures and errors but quite vulnerable under targeted attacks 
(Holme et al. 2002). 



3 Methods and Data 

To understand the topology and function of dark networks, we studied four terrorist- 
and criminal-related networks: 

1. The global Salafi jihad (GSJ) terrorist network (Sageman 2004) (see Fig. 6.1), 
which consists of 366 members including those from Osama bin-Laden’s al- 
Qaeda. These terrorists were connected by kinship, friendship, religious ties, and 
relations formed after they joined the GSJ network. 

The terrorists belong to one of four groups: al-Qaeda or Central Staff (pink), 
Core Arabs (yellow), Maghreb Arabs (blue), and Southeast Asians (green). Each 
circle represents one or more terrorist activities, such as the 9/1 1 attacks and Bali 
bombing, which are noted. 

The GSJ data were provided by the author of a recently published book, 
“Understanding Terror Networks” (Sageman 2004). The network was con- 
structed based entirely on open source data including the documents and tran- 
scripts of court proceedings, press and scholarly articles, and web articles of the 




Fig. 6.1 The giant component in the GSJ network (Data courtesy of Marc Sageman) 
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past few decades. Information about all the nodes (terrorists) and links (rela- 
tions) was scrutinized and carefully cross-validated. However, as the author 
pointed out in the book, the data are subject to several limitations. First, the 
members in the network may not be a representative sample of the global Salah 
jihad as a whole. It is biased toward leaders and the members who have been 
captured or uncovered in executed attacks. Second, because most of the sources 
were based on retrospective accounts, the data may be subject to self-reported 
biases. Despite the limitations, the data have revealed stunning insights into the 
clandestine organizations of terrorists. Interested readers may refer to the book 
for more details about these terrorist organizations. 

2. A narcotics-trafficking criminal network (“Meth World”) whose members mainly 
dealt with methamphetamines (Xu and Chen 2003). The network consists of 
1,349 criminals, who have been traced and examined by the Tucson Police 
Department since 1985. Because no information about the social relationships 
among these members was directly available, we retrieved from the police data- 
bases all the crime incidents in which these criminals were involved from 1985 
to 2002. A link was created between two criminals if they committed at least one 
crime together. 

Although this network had been carefully validated by the crime analysts from 
the police department (Xu and Chen 2003), the co-occurrence links generated 
based on crime incidents may not reflect the real relationships between criminals. 
Two related criminals will appear to be unconnected if they never commit a crime 
together. On the other hand, a coincidental link may connect two criminals together 
if they happen to appear in the same incident. These two problems are also com- 
mon to other types of networks such as the movie actor networks that are con- 
structed based on the co-occurrence of two nodes in the same events or activities. 

3. A gang criminal network consisting of 3,917 criminals who were involved in 
gang-related crimes in Tucson between 1985 and 2002 (Xu and Chen 2003). As 
in the Meth World, the links in this network were generated using co-occurrence 
analysis of the crime incidents. 

4. A terrorist web site network (“Dark Web”) collected based on reliable govern- 
mental sources. We identified 104 web sites created by four major international 
terrorist groups, namely, Al-Gama’a al-Islamiyya, Hezbollah, Al-Jihad, and 
Palestinian Islamic Jihad and their supporters. All pages from these web sites 
were fetched, and the hyperlinks were extracted. We created a link between two 
web sites if at least one hyperlink existed between any two web pages in them. 



4 Results and Discussion 
4.1 Basic Properties 

Table 6.2 presents the basic statistics of the four elicited networks under study. Like 
many other empirical networks, each network contains many small components 
and a single giant component. The giant component in a graph is defined as the 
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Table 6.2 The basic statistics and scale-free properties of the dark networks 





GSJ 


Meth World 


Gang network 


Dark Web 


Number of nodes, n 


366 


1,349 


3,917 


104 


Number of links, m 


1,247 


4,784 


9,051 


156 


Size of giant component 


356 (97.3%) 


924 (68.5%) 


2,231 (57.0%) 


80 (77.9%) 


Average degree, <k> 


6.97 


4.62 


5.74 


3.88 


Maximum degree 


44(12.4%) 


37 (4.0%) 


51 (2.3%) 


33 (41.3%) 


Link density, d 


0.02 


0.01 


0.003 


0.05 


Assortativity, r 


0.41** 


-0.14** 


0.17** 


-0.24* 


Power-law distribution exponent, y 


1.38 


1.86 


1.95 


1.10 


Goodness of fit, R 2 


0.74 


0.89 


0.81 


0.82 



The numbers in parentheses in the third row are the percentages of total nodes included in the giant 
components. The numbers in parentheses in the fifth row are the percentages of the total nodes that 
are connected with the highest-degree nodes 
**p-value< 0.05 
*p-value< 0.01 



largest connected subgraph. The separation between the 356 terrorists in the GSJ 
network and the remaining ten terrorists is because no valid evidence has been 
found to connect the ten terrorists to the giant component of the network. The giant 
components in the Meth World and gang networks contain only 68.5% and 57.0% 
of the nodes, respectively. This may be because the data were collected from a 
single law enforcement jurisdiction which may not have complete information 
about all relations between criminals, causing missing links between the giant com- 
ponent and other smaller components. The isolated components in the Dark Web 
are possibly due to the differences in the terrorist groups’ ideologies. 

Similar to many other network topology studies (e.g., Barabasi et al. 2002), we 
performed topological analysis only on the giant component in these networks. In 
Table 6.2, we report the average degrees and maximum degrees of the four net- 
works. It can be seen that some terrorists in the GSJ network and some terrorist web 
sites in the Dark Web are extremely popular, connecting to more than 10% of the 
nodes in the networks. 

The assortativity indicates the tendency for nodes to connect with others who are 
similarly popular in terms of degree. The assortativity coefficients of these four net- 
works are all significantly different from 0. The GSJ and the gang networks present 
positive assortativity, meaning that popular members tend to connect with other 
popular members. In positively assortative networks, high-degree nodes tend to clus- 
ter together as core groups (Newman 2003). This phenomenon is especially evident 
in the GSJ network in which bin-Laden and his sergeants form the core of the net- 
work and issue commands to other parts of the network (Sageman 2004). The Meth 
World and Dark Web, in contrast, have negative assortativity coefficients - disas- 
sortativity. The Meth World consists of drug dealers who sold drugs to many indi- 
vidual buyers; the buyers did not connect with many other buyers or dealers. Further, 
studies have found that street drug-dealing organizations are led by a few high-level 
individuals, who connect with a large number of low-level street drug dealers (Levitt 
and Dubner 2005). Because high-degree nodes connect with low-degree nodes, the 
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Meth World presents disassortative mixing patterns. The disassortativity in the Dark 
Web, on the other hand, is due to the fact that the popular Dark Web sites received 
many inbound hyperlinks from less popular web sites. 



4.2 Small-World Properties 

To ascertain if the dark networks are small worlds, we calculated their average path 
lengths, clustering coefficients, and global efficiency (see Table 6.3). For each net- 
work, we generated 30 random counterparts that had the same number of nodes and 
the same number of links as in the corresponding elicited networks. We found that 
all these networks have significantly high clustering coefficients compared with 
their random counterparts. In addition, although the differences are statistically sig- 
nificant (greater than three standard deviations), the average path lengths of these 
networks (except for the gang network) are just slightly greater than their random 
counterparts. 

These small-world properties imply that a terrorist or criminal can connect with 
any other member in a network through only a few mediators. In addition, these 
networks are quite sparse with very low link density. These properties have impor- 
tant implications for the communication efficiency of the covert networks. Because 
the risk of being detected by authorities increases as more people are involved, the 
small path length and link sparseness can help lower risks and enhance efficiency. 
As a result, the global efficiency of each network is compatible with their random 
network counterpart. 

On the other hand, a high clustering coefficient contributes to the local efficiency 
of these dark networks. Previous studies have also shown evidence of groups and 
teams in these networks (Sageman 2004; Xu and Chen 2003). In these groups and 
teams, members tend to have denser and stronger relations with one another. The 
communication between group members becomes more efficient, therefore making 
a crime or an attack easier to plan, organize, and execute. 

We also calculated the path length of other nodes to central nodes. We found that 
members in the criminal and terrorist networks are extremely close to their leaders. 
The terrorists in the GSJ network are on average only 2.5 steps away from bin- 
Laden, meaning that bin-Laden’s command can reach an arbitrary member through 
only two mediators. Similarly, the average path length to the leader in the Meth 
World (Xu and Chen 2003) is only 3.9. Such a short chain of command also means 
communication efficiency. 

Special attention should be paid to the Dark Web. Despite the small size of its 
giant component (80), the average path length is 4.70, slightly larger than that (4.20) 
of the GSJ network, which has almost nine times more nodes. Since hyperlinks help 
visitors navigate between web pages, and because terrorist web sites are often used 
for soliciting new members and donations, the relatively big path length may be due 
to the reluctance of terrorist groups to share potential resources with other terrorist 
groups. 
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Fig 6.2 The cumulative degree distributions of (a) the GSJ network, (b) the Meth World, (c) the 
gang network, and (d ) the Dark Web 



4.3 Scale-Free Properties 

Moreover, these dark networks present scale-free properties with power-law degree 
distributions, p(k)~k~ T . Because degree distribution curves often fluctuate a lot, we 
display the cumulative degree distributions, P(k), in a log-log plot (see Fig. 6.2). 
P(k ) is defined as the probability that an arbitrary node has at least k links. 
Figure 6.2 also presents the fitted power-law distributions. The last two rows of 
Table 6.2 report the exponent value, 7 and the goodness of fit, R 2 , for each net- 
work. It is evident in Fig. 6.2 that all these networks are scale-free networks. The 
power-law distributions fit especially well at the tails. Note that each of the three 
human networks displays a two-regime scaling behavior, which has also been 
observed in some other empirical networks such as scientific collaboration networks. 

Two mechanisms have been proposed to account for the emergence of two- 
regime power-law degree distributions during the evolution of a network (Barabasi 
et al. 2002). First, new links may emerge between existing network members. This 
implies that criminals or terrorists who were not related before could become con- 
nected as time progressed. This is a rather realistic assumption since two previously 
unacquainted members could become acquaintances through the introduction of a 
third member who knows both of them. In the GSJ network, 22.6% of the links were 
postjoining ties which were formed between existing members. Second, an existing 
link may be rewired, which is a strong possibility in the GSJ and Dark Web. However, 
this would not affect the Meth World and the gang network, because a co-occurrence 
link could not be rewired once it was created. 
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An interesting question follows: what mechanisms have played a role in producing 
the observed properties of the dark networks, namely, small average path length, high 
clustering coefficient, and power-law degree distributions with two-regime scaling 
behavior in the human networks? In other words, can we regenerate the dark networks 
based on known mechanisms such as growth and preferential attachment? To answer 
this question, we conducted a series of simulations in which 30 networks were gener- 
ated for each elicited human network based on three mechanisms: 

(a) Growth. Starting with a small number of nodes, at each time step, we add a new 
node to connect with existing nodes in the network. 

(b) Preferential attachment. The probability that an existing node will receive a 
link from the new node depends on the number of links the node has already 
maintained. The more links it has the more likely that it will receive the link. 

(c) New links between existing nodes. At each time step, a random pair of existing 
nodes may get connected depending on the number of common neighbors they 
have. The more common neighbors they share the more likely it is that they will 
be connected. 

Mechanisms (a) and (b) were expected to generate the power-law degree distri- 
bution, and mechanism (c) was expected to generate the high clustering coefficient 
and two-regime scaling behavior. Through the simulations, we found that the power- 
law degree distributions could be easily regenerated with R 2 ranging from 0.83 
(the gang network) to 0.88 (GSJ). The two-regime scaling behavior was also present 
in the simulated networks for the human networks. However, the highest clustering 
coefficient from simulation was only 0.24 (GSJ), which was far less than what was 
obtained from the elicited networks (0.55-0.68). This finding implies that there 
must have been some other mechanisms that contributed to the substantially high 
clustering coefficients observed in the dark networks. We suspect that such a mech- 
anism is member recruitment. Because of active recruitment, subgroups of terrorists 
or criminals could attract new members into their groups. The new members quickly 
become acquainted with many existing members of the groups, substantially 
increasing the clustering coefficients. 



4.4 Caveats 

A point worth paying attention to is that two problems may have affected the three 
elicited human networks. First, the elicited networks may have missing links that 
can cause the networks to appear to be less efficient, because there may actually be 
hidden “shortcuts” connecting distant parts of the networks. Second, the presence of 
coincidental “fake” links can cause the elicited networks to be more efficient than 
in actuality since these links are actually not communication channels. 
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To test how the results would be affected by missing links, we added various 
percentages of the existing links to the elicited networks based on three effects that 
had been used in missing link prediction research (Liben-Nowell and Kleinberg 2003): 

(a) Random effect. A link is added between a randomly selected pair of nodes 
which are not originally connected. 

(b) Common neighbor effect. A link is added between a pair of unconnected nodes 
if they share common neighbors. The more common neighbors they share, the 
more likely it is that they will be connected. 

(c) Preferential attachment effect. The probability that a pair of unconnected nodes 
will be linked together depends on the product of their degrees. 

We found that the small-world and scale-free properties of these networks do not 
change when missing links are added to the networks. For example, when as high 
as 10% of the links were added, the average path lengths ranged from 3.55 (GSJ, 
preferential attachment links added) to 9.45 (the gang network, common neighbor 
links added), the clustering coefficients ranged from 0.45 (GSJ, random links added) 
to 0.67 (the gang network, common neighbor links added), and the R 2 of power-law 
degree distributions ranged from 0.61 (GSJ, random links added) to 0.93 (the gang 
network, preferential attachment links added). 

We also randomly removed certain percentages of links to test the impact of 
“fake” links on the results. We found that the results were still valid even when as 
high as 10% of the links were removed. 



4.5 Network Robustness 

Research has found that network topology has a great impact on the network’s 
robustness against failures and attacks and that scale-free networks are quite robust 
against failures (random removal of nodes) (Holme et al. 2002). Because dark net- 
works have been shown to have scale-free properties, we tested the dark networks’ 
robustness against only targeted attacks. 

We simulated two types of attacks represented by node removal: attacks target- 
ing the hubs and attacks targeting the bridges. While hubs are nodes that have many 
links (high degree), bridges are nodes through which many shortest paths pass (high 
betweenness (Wasserman and Faust 1994)). When simulating the attacks, we distin- 
guished between two attack strategies: simultaneous removal of a fraction of nodes 
based on a measure (degree or betweenness) without updating the measure after 
each removal and progressive removal of nodes with the measure being updated 
after each removal. 

We plotted the changes in S (the fraction of the nodes in the largest component), 
<s>(the average size of remaining components), and average path length after a 
fraction of nodes are removed. We found that progressive attacks are more devastat- 
ing than simultaneous attacks. The progressive attacks are similar to “cascading 
failures” in the Internet where an initial failure can cause a series of failures because 
unbearably high traffic is redirected to the next bridge node. 
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Fig 6.3 Dark networks’ robustness against attacks, (a) Progressive attacks to the GSJ network, 
(b) Progressive attacks to the Dark Web. Two types of attacks are used: hub attack (filled markers) 
and bridge attack ( empty markers) 



Figure 6.3a-b presents the difference between the network reactions to bridge 
attacks and hub attacks. The critical points, f, at which the network falls into many 
small components, are marked on the diagram. The behaviors of the Meth World 
and the gang network are very similar to that of the GSJ network. It shows that these 
terrorist and criminal networks are more sensitive to attacks targeting the bridges 
than those targeting the hubs (fb < fh). In Fig. 6.3b, however, fb and fh are very 
close; indicating that hub attacks and bridge attacks can be equally effective to dis- 
rupt a one-regime scale-free network. 
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The results are quite consistent with findings from a prior study (Holme et al. 
2002) that pure scale-free networks are vulnerable to both hub and bridge attacks, 
while small-world networks are more vulnerable to bridge attacks. In a small-world 
network, which consists of communities and groups, there might be many bridges 
linking different communities together. Intuitively, when these bridges are removed, 
the network will quickly fall apart. Note that a bridge may not necessarily be a hub 
since a node that connects two communities can have as few as two links. Small- 
world networks such as the dark networks thus are more vulnerable to bridge attacks 
than hub attacks. 

In these dark networks, bridges and hubs usually are not the same nodes. The 
rank order correlations between degree and betweenness in the GSJ, Meth World, 
and the gang network are 0.63, 0.47, and 0.30, respectively. Note that although 
bridge attacks are more devastating, strategies targeting the hubs are also fairly 
effective since these networks also have scale-free properties. Hub attacks and 
bridge attacks can be equally effective in tearing apart a pure scale-free network 
(e.g., the Dark Web with a high degree-betweenness rank order correlation, 0.70), in 
which hubs are also bridges connecting different parts of the network. 



5 Conclusions 

Dark networks such as terrorist networks and narcotics-trafficking networks are 
hidden from our view yet could have a devastating impact on our society and econ- 
omy. Understanding the topology of these dark networks can reveal greater insight 
into these clandestine organizations and help develop effective disruptive strategies. 
However, reliable data about these dark networks are extremely difficult to obtain, 
causing our understanding of covert networks to remain mostly hypothetical. To the 
best of our knowledge, the datasets used in this chapter, although they are subject to 
several limitations, are the first sets which allow for statistical analysis of the topol- 
ogies of dark networks. 

We found that these covert networks share many common topological properties 
with other types of networks. Their efficiency in communication and flow of informa- 
tion, commands, and goods can be tied to their small-world structures, characterized 
by small average path length and high clustering coefficient. In addition, we found 
that because of the small-world properties, dark networks are more vulnerable to 
attacks on the bridges that connect different communities than to attacks on the hubs. 
This may provide authorities with insight for intelligence and security purposes. 

An interesting finding about the three human networks is that their substantially 
high clustering coefficients, which are not always present in other empirical net- 
works, are difficult to regenerate based only on the known network effects such as 
preferential attachment and small- world effects. Other mechanisms such as recruit- 
ment may have played an important role in the evolution of these networks. Some 
research has found that alternative mechanisms such as Highly Optimized Tolerance 
(HOT) may govern the evolution of many complex systems in environments with 
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high risks and uncertainty (Carlson and Doyle 1999). In our future research, we will 
study the impacts of such alternative mechanisms on the topology of networks. In 
addition, the findings presented in this chapter are all based on the static views of 
the networks. That is, we do not consider a large variety of dynamics that might 
have taken place in the evolution of these networks. Evolution study is definitely in 
our plan for future research. 

We want to point out that people should be careful when interpreting these find- 
ings. Because dark networks are covert networks and the underlying actual net- 
works are largely unknown, there may be hidden links which are missing in the 
elicited networks. These hidden links may play indispensable roles in maintaining 
the function of the covert organizations. As a result, we must be extremely cautious 
when any decision is to be made to disrupt these networks. 
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Chapter 7 

Interactional Coherence Analysis 



1 Introduction 

Computer-mediated communication (CMC) is any form of communication between 
two or more individuals who interact and/or influence each other via computer- 
supported media. Text-based modes of CMC include e-mail, listservs, forums, chat 
rooms, instant messaging, and the World Wide Web (Herring 2002). There is no 
doubt that the popularity of CMC is continuing to grow. E-mail, Web forums, news- 
groups, and chat rooms have already become essential parts of our daily lives, pro- 
viding a communication medium for various activities (Meho 2006; Radford 2006). 
Although the ubiquitous nature of CMC provides a convenient mechanism for com- 
munication, it is not without its shortcomings. The fragmented, ungrammatical, and 
interactionally disjointed nature of CMC discourse, attributable to the limitations of 
the CMC media, has rendered CMC highly incoherent (Hale 1996). 

Beaugrande and Dressier (1996) defined coherence in linguistics as a “continuity 
of senses” and “the mutual access and relevance within a configuration of concepts 
and relations.” For Web discourse, coherence defines the macrolevel semantic struc- 
ture (Barzilay and Elhadad 1997). Barzilay and Elhadad (1997) further pointed out 
that “coherence is represented in terms of coherence relations between text seg- 
ments, such as elaboration, cause and explanation.” Coherence of online discourse, 
correspondingly, is represented in terms of the “reply-to” relations between CMC 
messages. The “reply-to” relationships can serve several functions, such as elabo- 
rating or complementing previous postings, greeting fellow users, answering ques- 
tions, or oppugning previous messages. 

Computer-mediated interaction (CMI) refers to the social interaction between 
CMC users (Walther et al. 1994). Such social interaction is built through the “reply- 
to” relationships between messages. Therefore, we also refer to the “reply-to” rela- 
tionship as the interaction relationship between messages. A social interaction in 
online discourse happens if a user posts a message that has a “reply-to” relation with 
other users’ messages. Occasionally, a user may interact with other users without 
specifying the messages he or she responds to. Common greeting messages like 
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“Hi Jatt” are examples. But we can build fake “reply-to” relationships between such 
messages with the addressed user’s nearest message. This method does not affect 
the social interaction relationships between the users. 

Since the “reply-to” relations between CMC messages can be used to build the 
social interaction between users, coherence of CMC is also called CMC interac- 
tional coherence in previous studies (e.g., Herring 1999). However, current CMC 
media suffer the “disrupted turn adjacency” problem, and the existing system func- 
tionalities do not contain sufficient “reply-to” information. In light of the incoherent 
and fragmented nature of text-based Web discourse, many researchers have pointed 
out the importance of automatically identifying CMC interactional coherence. 
Te’eni (2001) claimed that interactional coherence information is particularly 
important “when there are several participants” and “when there are several streams 
of conversation and each stream must be associated with its particular feedback.” 
Users of CMC systems cannot safely assume that they will receive a response to 
their previous message because of the lack of interactional coherence (Herring 
1999). Accurate interaction information is also important to researchers for a pleth- 
ora of reasons. User interaction in text-based CMC represents one of the fundamen- 
tal building block metrics for analyzing cyber communities. Interaction-related 
attributes help identify CMC user roles and user’s social and informational value, as 
well as the social network structure of online communities (Smith and Fiore 2001; 
Fiore et al. 2002; Barcellini et al. 2005). Moreover, interactional coherence is invalu- 
able for understanding knowledge flow in electronic communities and networks of 
practice (Osterlund and Carlile 2005; Wasko and Faraj 2005). 

Interactional coherence analysis (ICA) attempts to accurately identify the “reply- 
to” relationships between CMC messages so that we can reconstruct CMC interac- 
tional coherence and present the social interaction between CMC users. Previously 
used ICA features include system-generated attributes such as quotations and mes- 
sage headers, as well as linguistic features such as repetition of keywords across 
postings (Sack 2001; Spiegel 2001; Yee 2002). Although considerable efforts have 
been devoted to improving interaction representations using ICA, previous studies 
suffer from several limitations. Most used a couple of specific features, whereas 
effective capture of interaction cues entails the use of a larger set of system and lin- 
guistic attributes (Nash 2005). Furthermore, the techniques incorporated often 
ignored noise issues such as typos, misspellings, nicknames, etc., which are prevalent 
in CMC (Nasukawa and Nagano 2001). In addition, there has been little emphasis on 
Web forums, a major form of asynchronous online discourse. Previous work has 
focused on e-mail-based newsgroups and chat rooms. Web forums differ from e-mail 
and synchronous forms of electronic communication in terms of the types of salient 
coherence cues, user behavior, and communication dynamics (Hayne et al. 2003). 

In this study, we propose the Hybrid Interactional Coherence (HIC) algorithm for 
Web forum interactional coherence analysis. HIC attempts to address the limitations 
of previous studies by utilizing a holistic feature set which is composed of both 
linguistic coherence attributes and CMC system features. The remainder of this 
chapter is organized as follows: Sect. 2 presents a review of previous ICA research. 
Section 3 highlights important research gaps and questions. Section 4 presents a 
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system design geared toward addressing the research questions, including the use of 
the HIC algorithm with an extended set of system and linguistic features. It also 
provides details of the various components of our HIC algorithm. Experimental 
results based on evaluations of the HIC algorithm in comparison with previous tech- 
niques are described in Sect. 5. Section 6 includes conclusions and future directions. 



2 Related Work 

CMC interactional coherence is crucial for both researchers and CMC users. 
Interaction information can be used to identify user roles and messages’ values, as 
well as the social network pertaining to an online discussion. Example applications 
that can benefit from accurate online discourse interaction information include ana- 
lyzing the effectiveness of e-mail-based interviewing (Meho 2006) and chat-based 
virtual reference services (Radford 2006). Interactional coherence analysis provides 
users and researchers a better understanding of specific online discourse patterns. 
Unfortunately, deriving interaction information from online discourse can be prob- 
lematic, as discussed below. 



2.1 Obstacles to CMC Interactional Coherence 

Two properties of the CMC medium are often cited as obstacles to CMC interac- 
tional coherence (Herring 1999): lack of simultaneous feedback and disrupted turn 
adjacency. Most CMC media are text-based so they lack audio or visual cues preva- 
lent in other communication mediums. Furthermore, text-based messages are sent 
in their entirety without any overlap. These two characteristics result in a lack of 
simultaneous feedback. However, advanced CMC media have already provided 
simple solutions to address this concern. For example, newer versions of instant 
messaging software include audio and video capabilities in addition to the standard 
text functionality. These tools also show whether a user is typing a response, thereby 
providing response cues allowing interaction in a manner more similar to face-to- 
face communication. Since those solutions perform quite well, lack of simultaneous 
feedback is no longer a severe problem for CMC interactional coherence. 

In contrast, resolving the disrupted turn adjacency problem remains an arduous 
yet vital endeavor. Disrupted turn adjacency refers to the fact that messages in CMC 
are often not adjacent to the postings they are responding to. Disrupted adjacency 
stems from the fact that CMC is “turn-based.” As a result, the conversational struc- 
ture is fragmented, that is, a message may be separated both in time and place from 
the message it responds to (Herring 1999). Both synchronous (e.g., chat rooms, 
instant messaging) and asynchronous (e.g., e-mail, forums) forms of CMC suffer 
from disrupted turn adjacency. Several previous studies have observed and analyzed 
this phenomenon. Herring and Nix (1997) found that nearly half (47%) of all turns 
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User 


Feature 


Message 




Ashna 


Direct Address 


Hi jatt 


' 


Dave-G 


Direct Address 


Kally 1 was only joking around 




Jatt 


Direct Address 


Ashna: hello? 


> 


Kally 


Substitution 


1 don't think so. 





Ashna 


Direct Address 


How are u jatt 






& Co-reference 




\ 


LUCKMAN 


N/A 


SSa all 




Dave-G 


Co-reference 


Therefore we need to talk 







& Conjunction 






Jatt 


Lexical relation 


Do we know each other? 


/ 




& Co-reference 


I’m ok how are you 





Fig. 7.1 Example of disrupted adjacency 



were “off topic” in relation to the previous turn. Recently, Nash (2005) manually 
analyzed data from an online chat room and found that the gap between a message 
and its response can be as many as 100 turns. 

Figure 7.1 shows an example of disrupted adjacency taken from Paolillo (2006). 
The disruption is obvious in the example and is attributable to the fact that two dis- 
cussions are intertwined in a single thread. The lines to the right-hand side indicate 
the interaction relations among postings: two different widths are used to differenti- 
ate the parallel discussions. There is also one message that is not related to any of 
the other messages, posted by the user “LUCKMAN.” The middle column lists the 
linguistic features used in these messages, which will be introduced in Sect. 2.2.2. 

The objective of ICA is to develop techniques to construct the interaction rela- 
tions such as those shown in the right-hand side of the example. Such message 
interaction relations can be further used to construct the social network structure of 
CMC users, leading to a better understanding of CMC and its users and providing 
necessary information for improving ICA accuracy. A review of previous interac- 
tional coherence analysis research is presented in the following section. 



2.2 CMC Interactional Coherence Analysis 

Common interactional coherence research characteristics include domains, features, 
noise issues, and techniques. Table 7.1 presents a taxonomy of these vital CMC 
interactional coherence analysis characteristics. Table 7.2 shows previous CMC 
interactional coherence studies based on the proposed taxonomy. Header informa- 
tion and quotations (FI and F2) are system features, whereas features 3-6 (F3-F6) 
are linguistic features. A dashed line is used to distinguish these feature categories. 
The taxonomy and related studies are discussed in detail below. 
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Table 7.1 A taxonomy of CMC interactional coherence research 



Category 


Description 


Label 


Domain 

Synchronous CMC 


Internet Relay Chat (IRC), MUD, IM, etc. 


D1 


SMTP-based asynchronous CMC 


E-mail, newsgroups 


D2 


HTTP-based asynchronous CMC 


Web forums/BBS, Web blogs 


D3 


Text document 


News, articles, text files, etc. 


D4 


Feature 

Header information 


“Reply-to” information in header or title 


FI 


Quotation 


Copy previous related message in response 


F2 


Co-reference 


Personal, demonstrative, comparative co-reference 


F3 


Lexical relation 


Repetition, synonymy, superordinate 


F4 


Direct address 


Mention username of respondent 


F5 


Other linguistic features 


Substitution, ellipsis, conjunction 


F6 


Noise 

Typo, misspellings, nicknames, modified quotations 




Technique 

Manual 


Manually identify the interaction 


T1 


Link-based method 


Link messages by using CMC system features only 


T2 


Similarity-based method 


Word match, VSM, SVM, lexical chain 


T3 



Table 7.2 Selected previous CMC interactional coherence studies 



Previous studies 


Domains 


Features 
FI F2 


F3 


F4 


F5 F6 


Noise 


Techniques 


Xiong etal. (1998) 


SMTP-based 


y 










No 


Link-based 


Bagga and Baldwin 


Text 






✓ 






No 


Similarity-based 


(1998) 


















Choi (2000) 


Text 








✓ 




No 


Similarity-based 


Smith and Fiore (2001) 


SMTP-based 


V 










No 


Link-based 


Sack (2001) 


SMTP-based 


V 


V 








No 


Link-based 


Spiegel (2001) 


Synchronous 








✓ 




No 


Similarity-based 


Soon et al. (2001) 


Text 






✓ 






No 


Similarity-based 


Newman (2002) 


SMTP-based 


y 










Yes 


Link-based 


Yee (2002) 


SMTP-based 


y 










No 


Link-based 


Barcellini et al. (2005) 


SMTP-based 












- 


Manual 


Nash (2005) 


Synchronous 






✓ 


✓ 


✓ ✓ 


- 


Manual 



2.2.1 CMC Interactional Coherence Domains 

CMC interactional coherence research has been conducted on both synchronous 
and asynchronous CMC since both of these modes show a high degree of disrupted 
turn adjacency (Herring 1999). Synchronous CMC, which includes all forms of 
persistent conversation, suffers from multiple, intertwined topics of conversation 
(Khan et al. 2002). In comparison, asynchronous CMC has a “thread” function, 
which is an effective method for categorizing forum postings based on a specific 
topic. However, the “thread” function is not perfect. First, it does not show message-level 
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Explicit Implicit 



Header 


Quotation 


Direct 


Lexical 


Co-Reference 


Conjunction 


Information 




Address 


Relation 




Substitution 

Ellipsis 



Fig. 7.2 Features’ relative explicit/implicit properties 



interactions, which are vital for constructing the social network structure of CMC 
users. Instead, it is just an effort to group related messages together. Second, even in 
a single thread, subtopics might be generated during the discussion. This phenom- 
enon, which poses severe problems for Web forum information retrieval and content 
analysis, is called “topic decay/drift” (Herring 1999; Smith and Fiore 2001). 
Therefore, it is still necessary and important to apply interactional coherence analysis 
to asynchronous CMC. 

Asynchronous CMC modes can be classified into two categories: SMTP-based and 
HTTP-based. SMTP-based modes (e.g., Usenet) use e-mail to post messages to forums, 
whereas HTTP-based methods use forms embedded in the Web pages. Previous 
research often focused on SMTP-based modes because the headers of posted messages 
contain what is referred to as “reply-to information” that specifically mentions the ID 
of the message being responded to. Loom (Donath et al. 1999), Conversation Map 
(Sack 2001), and Netscan (Smith and Fiore 2001) are all well-known tools that have 
been developed to show interaction networks of Usenet newsgroups (SMTP-based). In 
contrast, HTTP-based modes such as Web forums and blogs do not contain such useful 
header information for constructing interaction networks. Consequently, there has 
been little work on HTTP-based CMC as illustrated by Table 7.2. 

We also incorporate text documents into our taxonomy because they experience 
some problems similar to CMC incoherence, such as co-reference resolution (Bagga 
and Baldwin 1998; Soon et al. 2001) and text segmentation (Choi 2000). Techniques 
used for text document co-resolution, such as sliding windows (Hearst 1994), lexi- 
cal chains (Morris 1988), and entity repetition (Khan et al. 1998), are applicable to 
all forms of text and can provide utility for CMC interactional coherence research. 

2.2.2 CMC Interactional Coherence Research Features 

Two categories of features have been used by previous CMC researchers and system 
developers. The first category is system features, which are functionalities provided 
by the CMC systems. The second one is linguistic features, which are interpersonal 
language cues. 

Nash (2005) defined explicit features as those that “make fewer assumptions 
about what information is activated for the recipients.” Figure 7.2 shows features’ 
relative explicit/implicit properties. Features on the left side are more explicit than 
those on the right side. Explicit features are generally easier to use for deriving 
interaction patterns. In contrast, implicit features such as conjunctions and ellipsis 
are far more difficult to accurately incorporate for interactional coherence analysis. 
The various features are described in detail below. 
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CMC System Features 

CMC system features are usually only provided by asynchronous CMC systems. 
Header information and quotations are two kinds of CMC system features that can 
be used to construct interaction networks of asynchronous online discourse. Lewis 
and Knowles (1997) pointed out that SMTP-based asynchronous CMC systems will 
“automatically insert into a reply message two kinds of header information: unique 
message IDs of parent messages and a subject line of the parent (copied to the reply 
message’s subject line).” Unique message IDs of the parent message are intuitively 
useful for interaction identification. In contrast, subject lines of messages are less 
useful because different conversations in the same thread may have similar subject 
lines. Unfortunately, for HTTP-based modes, only the second type of header infor- 
mation is available. As shown in Table 7.2, most previous studies for SMTP-based 
asynchronous CMC systems relied on header information (FI column) to construct 
interaction networks (e.g., Sack 2001; Barcellini et al. 2005). 

Quotations (F2 column), a context-preserving mechanism used in online discus- 
sions (Eklundh 1998), are less frequently used to represent online conversations. 
Conversation Map (Sack 2001) and Zest (Yee 2002) are among the few previous 
studies that used automatic quotation identification to address disrupted adjacency. 
Barcellini et al. (2005) manually analyzed quotations and used them to identify par- 
ticipants’ conversation roles. 

Although header information and quotations are effective for identifying interac- 
tion and should result in high precision intuitively, in reality, they suffer several 
drawbacks. From the systems’ point of view, only asynchronous CMC systems con- 
tain such features. Moreover, header information provided by HTTP-based asyn- 
chronous CMC systems is of little value in many cases where the subject lines of all 
subsequent messages are similar or even identical. Furthermore, from the users’ 
point of view, some participants do not use system features, and others may not use 
system functions correctly (Lewis and Knowles 1997; Eklundh and Rodriguez 
2004). For instance, interaction cues may appear in the message body. Finally, some 
messages can interact with multiple previous messages, and system features may 
not be able to capture such multiple interactions. As a result, using system features 
alone fails to consider such idiosyncratic user behavior, resulting in an incomplete 
representation of CMC interaction. 

As is shown in Table 7.2, previous research on SMTP-based asynchronous CMC 
relied mostly on system features to construct the interaction network. CMC systems 
incorporating system and linguistic features for identification of interaction pat- 
terns, such as the Conversation Map system proposed by Sack (2001), are a rarity. 
The Conversation Map system also constructs interaction networks primarily using 
system features, but then uses the message content to construct semantic networks, 
which display the discussion themes for interacting messages (Sack 2001). 

The content of messages, which can be represented by various linguistic fea- 
tures, may be useful to complement system features in constructing CMC interac- 
tions and in many cases may be even more important (Nash 2005). Therefore, our 
approach utilizes both CMC system and linguistic features to construct the inter- 
action network with the intention of creating a more accurate representation of 
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CMC interactional coherence and its social network structure. Important linguistic 
features are discussed in the following section. 



Linguistic Features 

Linguistic features are interpersonal language cues and content-based features. 
Previous research on synchronous CMC systems had to rely on linguistic features 
to construct interaction networks since no system features were available. Several 
linguistic features for online communication have been identified by previous 
research. Three prevalent features are direct address, lexical relations, and co-reference 
(Halliday and Hasan 1976; Herring 1999; Spiegel 2001; Nash 2005). 

Direct address takes place when a user mentions the username of another user 
whom he or she is addressing in the message. Coterie (Spiegel 2001), a visualiza- 
tion tool for conversation within Internet Relay Chat, looks for direct addresses of 
specific people to construct the interaction network. It is important to note that 
addressing someone is different from referencing someone. Take the following sen- 
tence as an example: “John, take care of your brother Tom.” The speaker is address- 
ing (and interacting) with “John” only, although “Tom” is also referenced. 

Lexical relations occur when a lexical item refers to another lexical item by hav- 
ing common meanings or word stems. Its most common forms are repetition and 
synonymy (Nash 2005). Lexical relations have also been widely used in previous 
studies of synchronous CMC systems. For example, Choi (2000) used repetition of 
keywords to identify relationships between messages. Techniques that compare text 
similarities are often used for identifying lexical relations, where two messages are 
considered to have an interaction if their similarity is above some predefined thresh- 
old (Bagga and Baldwin 1998). 

Co-reference also occurs when a lexical item refers to another one; however, 
such a relationship can only be identified by the context instead of the word mean- 
ings or stems. Personal co-reference is most commonly used in CMC. For example, 
the word “you” is frequently used to refer to the person a message addresses. Other 
co-references include demonstrative co-reference, which is made on the basis of 
proximity, and comparative co-reference, which uses words such as “same,” “simi- 
lar,” and “different” (Nash 2005). 

Some other linguistic features identified by previous studies include: conjunc- 
tions (e.g., but, however, therefore), substitution (e.g., “I think so.”), ellipsis (e.g., 
“Guess that would not be easy.”), etc. (Nash 2005). These features have rarely been 
incorporated in previous studies due to the difficulty in identifying such features and 
their lack of prevalence in online discourse. Figure 7.1 shows an example that 
includes most linguistic features mentioned here. 

Looking back to Table 7.2, we can see that most previous studies only utilized 
one or two specific features. Only Nash (2005) manually identified multiple lin- 
guistic features for an online chat room and found three of them to be dominant. 
Lexical relations covered 51% of the interaction pattern, whereas direct address and 
co-reference covered 28% and 15%, respectively. 
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2.2.3 CMC Interactional Coherence Analysis Techniques 

In light of the fact that several types of features can be used for interactional coher- 
ence analysis, many different techniques have previously been used to construct 
interaction patterns. These can be classified into three major categories: manual 
analysis, link-based techniques, and similarity-based techniques. 

Eklundh and Rodriguez (2004) manually identified lexical relations, direct 
address, and co-reference for one specific online discussion. Similarly, Nash (2005) 
identified and extracted six linguistic features for an English chat room. Barcellini 
et al. (2005) manually analyzed quotations and used them to identify participants’ 
conversation roles. Manual analysis of CMC interactional coherence has the obvi- 
ous advantage of accuracy. However, its disadvantage is also obvious: it is difficult 
to apply to large datasets and is labor intensive. 

Link-based techniques construct interaction patterns using system features or 
rules based on message sequences. These techniques are highly prevalent in previ- 
ous research because of their representational simplicity as compared to techniques 
that focus on linguistic features. Direct linkage techniques link messages based on 
header information and quotations. For residual messages unidentified by direct 
linkage, naive linkage (Commer and Peterson 1986) has been used. Naive linkage 
is a rule-based technique which proposes that a message is related to all prior mes- 
sages in the same discussion or the first message in the same discussion. The advan- 
tage of link-based techniques is that they are easy to implement. However, link-based 
techniques depend on the assumption that CMC users utilize system features cor- 
rectly. Moreover, naive linkage is of low accuracy and often overgeneralizes partici- 
pation patterns due to its simplistic rule-based properties. 

Similarity-based techniques typically use content similarity to construct interac- 
tion patterns. These techniques focus on uncovering interaction cues found in the 
message texts to provide insight into interactional coherence. The simplest method is 
exact match or direct match, which tries to identify repetition of words, word phrases, 
or even sentences (Choi 2000; Spiegel 2001). More advanced similarity-based tech- 
niques include vector space model, which has been used for the cross-document co- 
reference solution (Bagga and Baldwin 1998) as well as to identify quoted messages 
(Lewis and Knowles 1997), and lexical chains, which are often created using WordNet 
for text summarization and interaction identification (Barzilay and Elhadad 1997; 
Sack 2001). Similarity-based techniques are effective for identifying certain linguis- 
tic features (e.g., lexical relations and direct address). Some have been successfully 
applied in research related to text documents. However, similarity-based techniques 
are susceptible to noise and require careful selection of parameters. 



3 Research Gaps and Questions 

Based on our review of previous literature, we have identified several important 
research gaps. First, little interactional coherence analysis has been conducted for 
HTTP-based asynchronous CMC. Previous research focused on Usenet newsgroups 
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and e-mail, the headers of which contain accurate interaction information, rendering 
the use of system features sufficient for accurately capturing a large proportion of 
the interaction patterns. However, many Web site and ISP forums (e.g., Yahoo!, 
MSN) do not use the e-mail protocol. Relying only on system features for such 
CMC modes can result in a significant amount of neglected interaction information. 
Second, little previous research has implemented techniques that use both CMC 
system features and linguistic attributes for interactional coherence analysis. The 
use of a more holistic feature set comprised of features occurring in message head- 
ers and bodies could greatly improve interaction recall. Finally, there has been little 
emphasis in previous research that takes into account the impact of noise in CMC 
interaction networks. 

Based on the research gaps identified, we propose the following research 
questions: 

1 . How effectively can we analyze interactional coherence for HTTP-based Web 
forums using automated techniques? 

2. How can techniques that use both CMC system and linguistic features improve 
interaction representational accuracy as compared to methods that only utilize a 
single feature category? 

3. What impact do forum dynamics (i.e., user system usage behavior) exert on 
interaction representational accuracy? 



4 System Design: Hybrid Interactional Coherence System 

In order to address these research questions, we developed the Hybrid Interactional 
Coherence (HIC) algorithm to perform more accurate interactional coherence anal- 
ysis, that is, to identify the “reply-to” relationships between CMC messages. The 
algorithm has three major components: system feature match, linguistic feature 
match, and residual match. System feature match and the direct address part of the 
linguistic feature matching component are used to identify interactions stemming 
from relatively more explicit features (such as headers, quotations, and direct 
addresses). The lexical relation match and rule-based module (which derive interac- 
tion patterns from relatively implicit cues) are only utilized when more explicit 
features are not present in a posting. 

Several major types of noise have also been addressed. 

System features used in our implementation include both headers and quota- 
tions. With header information, unique IDs of parent messages are checked first. 
Message subject lines are also analyzed and used. With quotations, our algorithm 
can identify not only normal quotations but also two special types of quotation, that 
is, multiple quotations and nested quotation (Barcellini et al. 2005). The algorithm 
overcomes quotation noise by using a sliding window method, which compares part 
of the quotation to previous messages. The sliding window method has been suc- 
cessfully used in text similarity detection and authorship discrimination (Nahnsen 
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Fig. 7.3 HIC system design 



et al. 2005; Abbasi and Chen 2006). Compared with the sentence-level matching 
approach adopted by Newman (2002), the sliding window is better at dealing with 
quotation modifications made by systems or users because it is a character-level 
method (i.e., it compares substrings). 

With respect to linguistic features, our algorithm mainly uses direct address and 
lexical relations. For direct address, besides traditional simple name match, our 
algorithm uses Dice’s equation to overcome noise such as typos, misspellings, and 
nicknames. Dice’s equation uses character-level n-gram matching to identify seman- 
tically related pairs of words (Adamson and Boreham 1974). We also differentiate 
common words and usernames by using a lexical database and automatically gener- 
ated part-of-speech (POS) tags. For lexical relations, a Lexical Match Algorithm 
(LMA), developed based on the vector space model frequently used in information 
retrieval (Salton and McGill 1986), is adopted. 

A comprehensive residual matching mechanism is developed for the remaining 
messages. It improves the naive linkage method (Commer and Peterson 1986) by 
matching messages based on their context and co-reference features. Figure 7.3 
shows our system design. Details of each component are presented below. 



4.1 Data Preparation 

The data preparation component is designed to extract messages and their associated 
metadata from Web forums. All relevant header information is extracted first. Then 
each message’s quotation part and body text are separated using a parser program. 
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The parser program was also designed to deal with two special types of quotation, 
nested quotation and multiple quotations. Nested quotation happens when a message 
which already contains quotations is quoted. The parser program only parses the 
quotation that is nearest to the message. Sometimes, users respond to different quo- 
tations in one message, which is referred to as “multiple quotation.” The parser 
program parses all the related quotations. The final step of data preparation is to 
extract other relevant information from each message, such as author screen names, 
date stamps, message subjects, etc. 



4.2 HIC Algorithm: System Feature Match 

4.2.1 Header Information Match 

In header information match, unique message IDs of parent messages, if available, 
are used to identify interaction. Subject lines of messages in the same thread are 
often consistently similar with each other if they are automatically generated by 
CMC systems. However, if CMC users intentionally embed interaction cues within 
them, subject lines can be used to identify interaction patterns as well. 



4.2.2 Quotation Match 

In quotation match, the quotation part of each message is compared with the body 
text of previous messages. As previously mentioned, CMC systems may modify the 
format of quotations (Newman 2002), whereas CMC users may modify quotations 
to save space (Eklundh 1998). Therefore, in our system, the quotation part of each 
message is first searched for in the body text of all previous messages, referred to as 
“simple match.” If simple match fails due to the various aforementioned forms of 
noise, a sliding window method is triggered. 

A sliding window method breaks up a text into overlapping windows (substrings) 
and compares each window against previous body texts (Kjell et al. 1994; Nahnsen 
et al. 2005; Abbasi and Chen 2006). The system assigns the message (i.e., creates 
an interaction link) to the quoted message with the highest number of matching 
windows. The following example shows how a sliding window method with a win- 
dow size of ten characters and a jump interval of two characters can be used to 
identify modified quotations (Fig. 7.4). 



Original Message 


Quoted Content 


Message Text Windows 


Quoted Text Windows 


“What do you prefer?” 


“...do you prefer?” 


“What do yo” 


“...do you ” 






“at do you ” 


“.do you pr” 






“ do you pr?” 


“o you pref” 






“o you pref” 


“you prefer” 






“you prefer” 





Fig. 7.4 Example of sliding window method breaking a text into overlapping windows (substrings) 
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4.3 HIC Algorithm: Linguistic Feature Match 

Linguistic features are used to complement system features in constructing CMC 
interaction patterns. Nash (2005) found that direct address, lexical relations, and 
co-reference were three dominant linguistic features. Therefore, our Hybrid 
Interactional Coherence algorithm mainly uses direct address and lexical relations 
in linguistic feature match, whereas the co-reference feature is indirectly used in 
residual match. 



4.3.1 Direct Address Match 

In direct address match, each word of a message is compared to the screen names of 
previous messages’ authors. By only considering authors that have appeared in prior 
postings within the same thread, we reduce the possibility of incorrectly consider- 
ing username references to be direct addresses. For the previous example “John, 
take care of your brother Tom,” if user “Tom” has not already appeared in the thread, 
an interaction between the current message’s user and Tom will not be assigned. In 
situations where a direct address-based interaction is found, the message containing 
the interaction cue is assumed to have a “reply-to” relation with the addressed user’s 
most recent posting. Initially, a simple match is performed in order to detect mes- 
sages containing the exact same author screen names. If no simple matches are 
found, a Dice-based character-level n-gram matching technique is used to compen- 
sate for the effect of prevalent direct address noise in CMC such as typos, misspell- 
ings, and nicknames. The technique first uses the following Dice equation, which 
has been successfully used in identifying semantically related pairs of words 
(Adamson and Boreham 1974; De Roeck and Al-Fares 2000), to estimate the simi- 
larity between a word and an author’s screen name: 

D . £ 2 X ( number of shared unique n — grams ) 

Total unique n — grams 

A preestablished experiment-based threshold is applied to improve the accuracy 
of direct address match. However, since many CMC users choose common English 
words as their screen names, word-sense disambiguation methods need to be applied 
to differentiate common usages of a word with the use of a word as a screen name. 
Our HIC algorithm makes use of WordNet (Miller 1990), which has already been 
widely used in word-sense identification (Voorhees 1993; Resnik 1995), to identify 
the meaning of words, and a POS tagger (McDonald et al. 2004) to generate the 
part-of-speech tags. Details of our direct address match are presented below: 

1 . For each screen name in the author list, query WordNet for meanings. 

2. For each word in a message, do the following: 

2. 1 Use Dice equation to find the most similar screen name appeared before. 
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2.2 If the highest Dice score is greater than a predefined threshold, query 
WordNet for the meanings of the word and do the following: 

2.2.1 If neither the word nor the screen name has meanings, assign direct 
address. 

2.2.2 Else, get POS tag for the word. If the word is a noun or noun phrase, 
assign direct address. 

2.2.3 Else, do not assign direct address for the word. 



4.3.2 Lexical Relations: The Lexical Match Algorithm 

Lexical relation match assumes an interaction between the two messages that are 
most similar. It calculates the lexical similarities among stopword-removed mes- 
sages when more explicit interactional coherence features such as quotations and 
direct address are not found. The key to lexical relation match is to develop an 
appropriate formula to calculate the similarity score. We propose a Lexical Match 
Algorithm (LMA) for lexical relation match. The LMA is designed to identify lexi- 
cal relation-based interactions between postings while taking into consideration the 
unique characteristics of CMC interaction, such as topic drift/decay and various 
forms of noise (e.g., misspellings, idiosyncrasies, etc.). The algorithm measures the 
similarity between messages based on the content as well as turn proximity and 
levels of inflection and/or idiosyncratic literary variation. LMA integrates the vec- 
tor space model with Dice’s equation and a turn-based proximity scoring 
mechanism. 

Vector space model (VSM) is one of the most popular methods used to identify 
lexical similarities (Salton and McGill 1986). By using word stems, VSM can also 
identify morphological word changes. However, in order to identify typos, misspell- 
ings, abbreviated references, and other forms of creative user behavior, the Dice 
equation (Adamson and Boreham 1974; De Roeck and Al-Fares 2000) is adopted in 
LMA to complement the traditional VSM. 

Additionally, a high degree of topic decay/drift has been found in asynchronous 
CMC (Herring 1999; Smith and Fiore 2001). Nash (2005) also noticed that most 
CMC interactions happen within three turns. Therefore, CMC interactions represent 
a “closeness” characteristic, which means two closer messages are more likely to 
interact than two messages further away. A topic decay factor calculated by the 
distance (number of turns) between two messages is adopted in our LMA formula 
to address this “closeness” characteristic. 

Here is our LMA formula for lexical similarity: 

LenX LenY ff+Tf 1 

'y' y — — — — x ( LenX x Len T) ' x ( Distance (X, Y )+ c) 

i=0 j= 0 Dfxi + Df Y j i f(p ice ^ xi ,Yj) >0 .5i) 
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X and Y are the two compared messages. LenX and LenY are the number of 
unique non-stopword terms in the two messages, Xi refers to the tth non-stopword 
word in message X, and Yj the /th non-stopword term in message Y. Tf is the term 
frequency and Df is the document frequency. Distance(X, Y) refers to the number of 
turns or messages between two compared messages. If there are N messages between 
the two compared messages, their distance is N+ 1 . C is a constant used to control 
the impact of message proximity on the overall similarity between two messages. 

In the formula, Dice(Xi, Yj) is used to compare two non-stopword terms. If their 
similarity is greater than 0.55, which is a predefined experiment-based threshold, a 
combined “tf-idf’ score is calculated. (LenX x Len Yf ] is the length normalization 
factor and ( Distance(X , Y) + Cf' is the topic decay factor mentioned before. If the 
highest score calculated by our formula is greater than 0.002, another threshold we 
use, an interaction is identified. Otherwise, residual match is used. The value of 
constant C and the two thresholds are developed based on a manual analysis of ten 
other threads in the LNSG forum. These ten threads are not included in our 
evaluation. 



4.4 HIC Algorithm: Residual Match 

Residual match is used for messages which do not contain obvious clues for auto- 
matic interaction identification. It is utilized to help enhance interaction recall by 
assigning interactions based on common communication patterns. Prior residual 
matches have used variants of the naive linkage method. One such implementation 
assigns each remaining posting (i.e., one with no identified interaction) to the first 
message in the thread (Commer and Peterson 1986). Other versions of naive linkage 
assign each posting to the preceding message. The intuition behind assigning each 
remaining post to the prior one is that messages are likely to interact with predeces- 
sors in close proximity, given the turn-based nature of CMC (Herring 1999). Since 
residual matching techniques use very general assignment rules, they tend to have 
lower precision as compared to other techniques which use system and/or linguistic 
interaction cues. We propose a new rule-based residual match method which con- 
siders the message proximity as well as the conversation structure and context. The 
details for our residual match are provided below: 



X: The residual message of author A 
Y : Previous message of author A 

Z: Messages of other authors which are posted between Y and X and are replies to messages 
of author A 

1 . If Y does not exist, X replies to the first message in the discussion. 

2. If Y exists and Z exists, X replies to Z. 

3. If Y exists and Z does not exist, X replies to what Y replies to. 



120 



7 Interactional Coherence Analysis 



The first rule is to apply the improved naive linkage method when the residual 
message is the first message the author has posted in the thread. The other two rules 
are generated based on two human communication characteristics, which can also 
be found in CMC. If people give feedback or raise questions to our proposed ideas 
and statements, it is natural for us to comment on the feedback or answer the ques- 
tions, which is characterized by the second rule. On the other hand, even if no feed- 
back is given, people tend to strengthen or make clear their previous statements, 
characterized by the third rule. 



5 Evaluation 

In order to evaluate the effectiveness of our HIC algorithm, an experiment was con- 
ducted. The experiment compared the HIC algorithm against the link- and similar- 
ity-based methods. The test bed and experimental design are described in detail 
below. 



5.1 Test Bed 

Our test bed consisted of a large extremist Web forum. The forum was the Libertarian 
National Socialist Green Party (LNSG) Forum (http://www.nazi.org/community/ 
forum/). Analysis of such social online communities is important in order to improve 
our understanding of these groups and organizations (Burris et al. 2000; Schafer 
2002; Chen 2005). For the forum, several of the longest threads were studied (shown 
in Table 7.3). 

All threads were manually tagged first by a single annotator to identify their inter- 
actional coherence. A sample of 100 messages from the annotator was also tagged 
by a second coder to check the accuracy of the tagging. Both independent annotators 
were graduate students with strong linguistic backgrounds. The annotators deter- 
mined a correct interaction by looking for interaction cues in every message. The 
cues included features found in message headers (e.g., an “RE:” in the subject line), 
quoted content from another message, linguistic cues inherent in the message body 
(e.g., direct address and lexical relations), as well as those based on the thread 
context (i.e., residual rule matching based on previous postings and interaction). 



Table 7.3 Details for datasets in test bed 



Forum 


Thread no. 


Thread subject 


No. of messages 


No. of users 


LNSG forum 


4 


Idea for banner/icon 


148 


24 




5 


Blue eyes, blond hair 


62 


22 




6 


Greetings 


85 


14 




7 


Race mixing 


143 


39 
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Table 7.4 Interaction feature breakdowns across threads 



Forum 


Thread no. 


No. of 
messages 


Quotation {%) 


Direct 
address (%) 


Lexical 
relation (%) 


Others (%) 


LNSG forum 


4 


148 


16.2 


16.2 


41.9 


25.7 




5 


62 


9.7 


9.7 


53.2 


27.4 




6 


85 


21.2 


24.7 


35.3 


18.8 




7 


143 


33.6 


8.4 


33.6 


24.4 




Overall 


438 


21.9 


14.4 


39.5 


24.2 



The annotators utilized the guidelines proposed by Nash (2005) for manually iden- 
tifying linguistic interaction cues. Figure 7.2 provided examples of how interactional 
coherence could be derived using linguistic features. The inter-coder reliability 
across the 100 messages had a kappa statistic of 0.88 which is considered to be reli- 
able. The tagging results were used as our gold standard. The interaction feature 
breakdowns across threads based on the manual tagging are presented in Table 7.4. 



5.2 Comparison of Techniques 

5.2.1 Experiment Setup 

In the first experiment, we compared our HIC algorithm with a link-based method 
that relies on system features, as well as against a similarity-based method, which 
relies on linguistic features. These comparison techniques were incorporated since 
variations of the link-based method and similarity-based method have been adopted 
in previous studies (Spiegel 2001; Soon et al. 2001; Newman 2002; Yee 2002). The 
purpose of this experiment was to study the effectiveness of the combined usage of 
system features and linguistic features, as done in the proposed HIC algorithm, over 
techniques mostly utilizing a single category of features. 

The link-based method uses the quotations in the header information for interac- 
tional coherence identification (Yee 2002). If a quotation exactly matches previous 
messages, the interaction is noted between the two postings. For remaining mes- 
sages, the naive linkage method is used, which assumes that the remaining mes- 
sages are replies to the first message. 

The similarity-based method consists of two parts: simple direct address match 
and vector space model match (Bagga and Baldwin 1998). The first part identifies 
interactional coherence when a word is an exact match with other authors’ screen 
names. The second part uses the traditional “tf-idf ’ score to identify lexical similarity. 
Threshold 0.2, shown as the best threshold by Bagga and Baldwin (1998), is used 
for this traditional VSM match. Precision, recall, and F-measure at both the forum 
and thread level were used to evaluate the performance of these methods. 
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Table 7.5 Experimental 
results for experiment 1 



Forum 


Technique 


Precision 


Recall 


F-measure 


LNSG forum 


HIC algorithm 


0.711 


0.711 


0.711 




Link-based 


0.560 


0.551 


0.555 




Similarity-based 


0.584 


0.678 


0.625 



Table 7.6 p values for 
pairwise t tests on 
accuracy for experiment 1 



Forum 


Techniques 


p values 


LNSG forum 


HIC vs. link-based 


<0.001* 




HIC vs. similarity-based 


<0.001* 




Link-based vs. similarity-based 


<0.001* 



*p values significant at alpha =0.01 



. . Number of Correctly Identified Interactions 

Precision = - - 

Total Number of Identified Interactions 



^ II Number of Correctly Identified Interactions 
Total Number of Interactions 



„ 2 x precision x recall 

F - measure = 

precision + recall 



5.2.2 Hypotheses 

Given the presence of system and linguistic interaction cues in online discourse, we 
believe that interactional coherence identification techniques incorporating both 
feature types are likely to provide better performance. Therefore, we propose the 
following hypotheses. 

HI a: The H1C algorithm will outperform the link-based method for Web forum 
interactional coherence analysis. 

Hlb: The HIC algorithm will outperform the similarity-based method for Web 
forum interactional coherence analysis. 

5.2.3 Experimental Results 

Table 7.5 shows the experimental results for all three methods. Our HIC algorithm 
had the best performance on both the forums in terms of precision, recall, and 
F-measure. 



5.2.4 Hypotheses Results 

Table 7.6 shows the p values for the pairwise t tests conducted on the interactional 
coherence identification accuracies to measure the statistical significance of the 
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results. Bolded values indicate statistically significant outcomes in line with our 
hypotheses. Both hypotheses, Hla and Hlb, are supported. 

Hla: The HIC algorithm outperformed the link-based method for both the Web 
forums (p<0.01). 

Hlb: The HIC algorithm outperformed the similarity-based method for both the 
Web forums (p<0.01). 

5.2.5 Results and Discussion 

The HIC algorithm performed better than both the link-based and similarity-based 
methods for our test bed. The F-measure was 8-15% higher than the other two tech- 
niques. Such improved performance was consistent across all threads in our test bed. 
The enhanced accuracy of the HIC algorithm was attributable to the incorporation 
of both system and linguistic features and its ability to handle various forms of CMC 
noise. For the LNSG forum, lexical relations were more commonly used as interac- 
tion cues, resulting in the improved performance of the similarity method over the 
link-based method on this forum. The LNSG forum members were less likely to 
utilize system features, which are heavily relied upon by the link-based method. 



6 Conclusions 

In this study, we applied interactional coherence analysis to Web forums. We devel- 
oped a hybrid approach that uses both CMC system features and linguistic features 
for constructing interaction patterns from Web discourse. The results show that our 
approach outperformed traditional link-based and similarity-based methods due to 
the use of a robust set of interaction features. In the future, we will work on analyz- 
ing user roles in Web forums based on interaction networks generated by the HIC 
algorithm. We are also interested in identifying interaction across different forums 
so that we can understand the information dissemination patterns across multiple 
forums, and in exploring the effectiveness of using thread-level interaction networks 
to identify important threads in Web forums. Another attractive direction is to apply 
our techniques to other CMC modes such as blogs and chat room discussions. Blogs 
have very similar system features to Web forums, including headers and quotations. 
Bloggers also share usage idiosyncrasies with Web forum posters, such as typos and 
misspellings. Chat rooms, however, usually do not have system features, and the 
chat postings are often too short to provide useful lexical information. By applying 
our algorithm to these two types of datasets, we may be able to identify the potential 
differences in their interactional coherence. 
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Chapter 8 

Dark Web Attribute System 



1 Introduction 

The weekly news coverage of excerpts from messages and videos produced and web- 
cast by terrorists/extremists has shown that terrorists and extremists have become 
exploiters of the Internet beyond routine communication operations. The Internet has 
dramatically increased their ability to influence the outside world (Arquilla and 
Ronfeldt, 1993). Several virtues of the Internet, such as ease of access, anonymity of 
posting, huge audience, and lack of regulations, have enabled terrorists to directly speak 
to millions of people - both supporters and adversaries - with little chance of being 
detected. As posited by lenkins (2004), through operating their own web sites and 
online forums, terrorists have effectively created their own “terrorist news network.” 
Terrorist/extremist organizations have generated thousands of web sites that sup- 
port psychological warfare, fundraising, recruitment, coordination, and distribution 
of propaganda materials. From those terrorist/extremist web sites, supporters can 
download multimedia training materials, buy games, T-shirts, and music CDs and 
access forums and chat services such as PalTalk (Elison, 2000; Tekwani, 2002; 
Bowers, 2004; Muriel, 2004; Weimann, 2004). Some web sites such as those 
associated with the jihad terrorist/extremist movement are extremely dynamic in 
that they emerge overnight, frequently modify their contents, and then swiftly “dis- 
appear” by changing their URLs which are later announced via online forums 
(Weimann, 2004). They are often hosted on free web space servers or by unsecured 
and poorly maintained commercial servers. Such web sites are technically supported 
by those who are Internet savvy to provide sophisticated propaganda images and 
videos via proxy servers to mask ownerships (Armstrong and Forde, 2003; El Deeb, 
2004). The level of technical sophistication of the Islamic terrorist/extremist organi- 
zations’ web sites has increased according to Katz, who monitors Islamic funda- 
mentalist Internet activities (Internet Haganah, 2005). The rapid proliferation and 
increased sophistication of web sites and online forums run by terrorist/extremist 
organizations are indications of the growing popularity of the Internet in terrorism 
campaigns. They also indicate that there is a vast pool of sympathizers that such 
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organizations have attracted, with some applying their IT expertise as contributions 
to the cause (Jesdanun, 2004). 

Although this alternate side of the Internet, referred to as the “Dark Web,” has 
received extensive government and media attention, there is a dearth of empirical 
studies that examine the sophistication of terrorist/extremist organizations’ web 
sites and how they support strategic and tactical information operations. Therefore, 
some basic questions about terrorist/extremist organizations’ Internet usage remain 
unanswered. For example, what are the major Internet technologies that they have 
used on their web sites? How sophisticated and effective are the technologies in 
terms of supporting communications and propaganda activities? 

In this chapter, we explore an integrated approach for collecting and monitoring 
terrorist-created web contents and propose a systematic content analysis approach 
to enable quantitative assessment of the technical sophistication of terrorist/extrem- 
ist organizations’ Internet usage. The rest of this chapter is organized as follows: In 
Sect. 2, we briefly review previous works on terrorists’ use of the Internet. In Sect. 3, 
we present our research questions and the proposed methodologies to study those 
questions. In Sect. 4, we describe the findings obtained from a case study of the 
analysis of technical sophistication, content richness, and web interactivity features 
of major Middle Eastern terrorist/extremist organizations’ web sites and a bench- 
mark comparison of Middle Eastern terrorist/extremist web sites and web sites from 
the US government. In the last section, we provide conclusions and discuss the 
future directions of this research. 



2 Literature Review 

2.1 Terrorism and the Internet 

Previous research showed that terrorists/extremists mainly utilize the Internet to 
enhance their information operations surrounding propaganda, communication, and 
psychological warfare (Thomas, 2003; Denning, 2004; Weimann, 2004). To achieve 
their goals, terrorists/extremists often need to maintain a certain level of publicity 
for their causes and activities to attract more supporters. Prior to the Internet era, 
terrorists/extremists maintained publicity mainly by catching the attention of tradi- 
tional media such as television, radio, or print media. This was difficult for them 
because terrorists/extremists often could not meet the editorial selection criteria of 
those public media (Weimann, 2004). With the Internet, terrorists/extremists can 
bypass the requirements of traditional media and directly reach hundreds of mil- 
lions of people globally, 24/7. 

Terrorist/extremist groups have sought to replicate or supplement the communi- 
cation, fundraising, propaganda, recruitment, and training functions on the Internet 
by building web sites with massive and dynamic online libraries of speeches, train- 
ing manuals, and multimedia resources that are hyperlinked to other sites that share 
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similar beliefs (Coll and Glasser, 2005; Weimann, 2004). The web sites are designed 
to communicate with diverse global audiences of members, sympathizers, media, 
enemies, and the public (Weimann, 2004). Table 8.1 summarizes terrorist/extremist 
groups’ objectives and tasks that are supported by web sites. 



2.2 Existing Dark Web Studies 

In recent years, there have been studies of how terrorists/extremists use the web to 
facilitate their activities (Zhou et al., 2005; Chen et al., 2004; ISTS, 2004; Thomas, 2003; 
Tsfati and Weimann, 2002; Weimann, 2004). For example, researchers at the Institute 
for Security Technology Studies (ISTS) have analyzed dozens of terrorist/extremist 
organizations’ web sites and identified five categories of terrorists’ use of the web: 
propaganda, recruitment and training, fundraising, communications, and targeting. 
These usage categories are supported by other studies such as those by Thomas (2003), 
Katz at SITE Institute (2004), and Weimann (2004). 

Since the late 1990s, several organizations, such as SITE Institute, the Anti- 
Terrorism Coalition (ATC), and the Middle East Media Research Institute (MEMRI), 
started to monitor contents from selected terrorist/extremist web sites for research 
and intelligence purposes. Tsfati and Weimann (2002) studied the content types and 
target audiences of terrorist/extremist organizations’ web sites by analyzing the con- 
tent of 29 Middle Eastern web sites. Table 8.2 lists some of the organizations that 
capture and analyze terrorists/extremists’ web sites (and the collection start dates) 
grouped into three functional categories: archive, research center, and vigilante 
community. 

Except for the Artificial Intelligence (Al) Lab, none of the enumerated organiza- 
tions seem to use automated methodologies for both collection building and analy- 
sis of the web sites. Due to the enormous size and the dynamic nature of the web, 
the manual collection and analysis approaches have limited the comprehensiveness 
of their analyses. Furthermore, none of the studies have provided empirical evi- 
dence of the levels of technical sophistication or compared terrorist/extremist orga- 
nizations’ cyber capabilities with those of mainstream organizations. Since technical 
knowledge required to maintain web sites provides an indication of terrorist/extrem- 
ist organizations’ technology adoption strategies (Jackson, 2001), we believe it is 
important to analyze the technologies required to maintain terrorist/extremists’ web 
sites from the perspectives of technical sophistication, content richness, and web 
interactivity. 



2.3 Dark Web Collection Building 

The first step toward studying the terrorist/extremist web presence is to capture ter- 
rorist web sites and store them in a repository for further analysis. Web collection 
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Table 8.1 How web sites support objectives of terrorist/extremist groups 



Terrorist/extremists ’ 








objectives 


Tasks supported by web sites 


Web features (Preece, 2000) 


Enhance 


• 


Composing, sending, and receiving 


Synchronous (chat, video 


communication 




messages 


conferencing, MUDs, and 


(Becker, 2005; 


• 


Searching for messages, 


MOOs) and asynchronous 


Weimann, 2004) 




information, and people 


(e-mail, bulletin board, forum, 




• 


One-to-one and one-to-many 


and Usenet newsgroup) 






communications 


GUI 






Maintaining anonymity 


Help function 
Feedback form 
Login 

E-mail address for webmaster 
and organization contact 


Increase fundraising 


• 


Publicizing need for funds 


Payment instruction and 


(Weimann, 2004) 


• 


Providing options for 


facility 






collecting funds 


E-commerce application 
Hyperlinks to other resources 


Diffuse propaganda 


• 


Posting resources in multiple 


Content management 


(Weimann, 2004) 




languages 


Hyperlinks 




• 


Providing links to forums, videos, 


Directory for documents 






and other groups’ web sites 


Navigation support 




• 


Using web sites as an online 


Search, browsable index 






clearinghouse for statements 


Free web site hosting 






from leaders 


Accessible 


Increase publicity 


• 


Advertising groups’ events, 


Downloadable files 


(Coll and Glasser, 




martyrs, history, and ideologies 


Animated and flashy banner, 


2005; Jenkins, 2004) 


• 


Providing groups’ 


logo, and slogan 






interpretation of the news 


Clickable maps 
Information resources 
(e.g., international news) 


Overcome obstacles 


• 


Send encrypted messages 


Anonymous e-mail accounts 


from law 




via e-mail, forums, or post 


Password-protected or 


enforcement and 




on web sites 


encrypted services 


military (Coll and 


• 


Move web sites to different servers 


Downloadable encryption 


Glasser, 2005; 




so that they are protected 


software 


Kelley, 2001) 






E-mail security 
Stenography 


Provide recruitment 


• 


Hosting martyrs’ stories. 


Interactive services (e.g.. 


and training 




speeches, and multimedia 


games, cartoons, and maps) 


(Weimann, 2004) 




that are used for recruitment 


Online registration process 




• 


Using flashy logos, banners, 


Directory 






and cartoons to appeal to sympa- 


Multimedia (e.g., videos. 






thizers with specialized skills and 


audios, and images) 






similar views 


FAQs, alerts 




• 


Build massive and dynamic 
online libraries of training resources 


Virtual community 
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Table 8.2 Organizations that capture and analyze terrorists’ web sites 



Organization 


Description 


Access 


Archives 


1 . Internet Archive (IA) 
Research centers 


1996 - Collect open access HTML 
pages (every 2 months) 


Via http://www.archive.org 


2. Anti-Terrorism 
Coalition (ATC) 


2003 - lihad watch. Has 448 terrorist 
web sites and forums 


Via http://www.atcoalition.net 


3. Artificial Intelligence 


2003 - Spidering (every 2 months) 


Via test bed portal called 


(Al) Lab, University 


to collect terrorist web sites. Has 


Dark Web Portal 


of Arizona 


thousands of web sites: US 
domestic, Latin America, and 
middle eastern web sites 


(ai.arizona.edu) 


4. MEMRI 


2003 - Jihad and terrorism studies 
project 


Access reports via 

http://www.memri.org 


5. SITE Institute 


2003 - Capture web sites every 24 h. 
Extensive collection of thousands 
of files 


Access reports and fee-based 
intelligence services 
http://siteinstitute.org 


6. Weimann (University 
of Haifa, Israel) 


1998 - Capture web sites daily, 
extensive collection of thousands 
of files 


Closed collection 


Vigilante Community 


7. Internet Haganah 


2001 - Confronting the global Jihad 
project. Has hundreds of links to 
web sites 


Provides snapshots of terrorist 
web sites http://haganah.us 



building is the process of gathering and organizing unstructured information from 
pages and data on the web. Previous studies have suggested three types of approaches 
to collecting web contents in specific domains: manual approach, automatic 
approach, and semiautomatic approach. 

In order to build the September 1 1 and Election 2002 Web Archives (Schneider 
et al., 2003), the Library of Congress collected seed URLs for a given theme. The 
seeds and their close neighbors (distance 1) are then downloaded. The limitation of 
such a manual approach is that it is time consuming and inefficient. 

Albertsen (2003) used an automatic approach in the “Paradigma” project. The 
goal of Paradigma is to archive Norwegian legal deposit documents on the web. 
It employed a focused web crawler (Kleinberg et al., 1998), an automatic program 
that discovers and downloads web sites in particular domains by following web 
links found in the HTML pages of a starting set of web pages. Metadata was then 
extracted and used to rank the web sites in terms of relevance. The automatic 
approach is more efficient than the manual approach; however, due to the limita- 
tions of current focused crawling techniques, automatic approaches often introduce 
noise (off-topic web pages) into the collection. 

The “Political Communications Web Archiving” group employed a semiauto- 
matic approach to collecting domain-specific web sites (Reilly et al., 2003). Domain 
experts provided seed URLs as well as typologies for constructing metadata that can 
be used in the crawling process. Their project’s goal is to develop a methodology for 
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constructing an archive of broad-spectrum political communications over the web. 
We believe that the semiautomatic approach is most suitable for collecting terrorist/ 
extremist web sites because it combines the high accuracy and high efficiency of 
manual and automatic approaches. 



2.4 Dark Web Content Analysis 

In order to reach an understanding of the various facets of terrorist/extremist web 
usage and communications, a systematic analysis of the web sites’ content is 
required. Researchers in the terrorism domain have used observation and content 
analysis to analyze web site data. In Bunt’s (2003) overview of jihadi movements’ 
presence on the web, he described the reaction of the global Muslim community to 
the content of jihadi terrorist web sites. His assessment of the influence such content 
had on Muslims and Westerners was based on a qualitative analysis of message 
contents extracted from Taliban and al-Qaeda web sites. Tsfati and Weimann (2002) 
conducted a content analysis of the characteristics of terrorist groups’ communica- 
tions. They said that the small size of their collection and the descriptive nature of 
their research questions made a quantitative analysis infeasible. 

Demchak et al. (2000) provided a well-defined methodology for analyzing com- 
municative content in government web sites. Their work focused on measuring 
“openness” of government web sites. To achieve this goal, they developed a Web 
Site Attribute System (WAES) tool that is basically composed of a set of high-level 
attributes such as transparency and interactivity. Each high-level attribute is associ- 
ated with a second layer of attributes at a more refined level of granularity. For 
example, the increase of “operational information” and “responses” on a given web 
page can induce an increase in the openness level of a government web site. This 
WAES system is an example of a well-structured and systematic content analysis 
methodology. 

Demchak et al.’s work provides guidance for this chapter. However, the “open- 
ness” attributes used in their work were designed specifically for e-government 
studies. We surveyed research in e-commerce, e-government, and e-education 
domains and identified several sets of attributes that could be used to study the tech- 
nical advancement and effectiveness of terrorists/extremists’ use of the Internet. 

Palmer and David’s (1998) study identified a set of 15 attributes (called “techni- 
cal characteristics” in the original work) to evaluate two aspects of e-commerce web 
sites: technical sophistication and media richness. More specifically, the technical 
sophistication attributes measure the level of advancement of the techniques used in 
the design of web sites, e.g., “use of HTML frames,” “use of Java scripts,” etc. The 
media richness attributes measure how well the web sites use multimedia to deliver 
information to their users, e.g., “hyperlinks,” “images,” “video/audio files,” etc. 

Another set of attributes called web interactivity has been widely adopted by 
researchers in e-government and e-education domains to evaluate how well web sites 
facilitate the communication among web site owners and users. Two organizations. 
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the United Nations Online Network in Public Administration and Finance (UNPAN; 
www.unpan.org) and the European Commission’s 1ST program (www.cordis.lu/ 
ist/), have conducted large-scale studies to evaluate the interactivity of government 
web sites of major countries in the world. The web interactivity attributes can be 
summarized into three categories; one-to-one-level interactivity, community-level 
interactivity, and transaction-level interactivity. 

The one-to-one-level interactivity attributes measure how well the web sites sup- 
port individual users to give feedback to the web site owners (e.g., provide e-mail 
contact, provide guest book functions, etc.). The community-level interactivity attri- 
butes measure how well the web sites support the two-way interaction between site 
owners and multiple users (e.g., use of forums, online chat rooms, etc.). The transaction- 
level interactivity measures how well users are allowed to finish tasks electronically 
on the web sites (e.g., online purchasing, online donation, etc.). 

Chou’s (2003) study proposed a detailed four-level framework to analyze e-education 
web sites' level of advancement and effectiveness. Attributes in the first level (called 
learner-interface interaction) of Chou’s framework are very similar to the technical 
sophistication attributes used in Palmer and David’s (1998) study. Attributes in the 
other three levels (learner-content interaction, learner-instructor interaction, and 
learner-learner interaction) of Chou’s framework are similar to the three-level web 
interactivity attributes used in the e-government evaluation projects as mentioned 
above. 

To date, no study has employed the technical sophistication, media richness, and 
web interactivity attributes as well as the WAES framework in the terrorism domain. 
We believe that these web content analysis metrics can be applied in terrorist/ 
extremist web site analysis to deepen our understanding of the terrorists’ tactical use 
of the web. 



3 Proposed Methodology: Dark Web Collection and Analysis 

The research questions postulated in this chapter are: 

1 . What design features and attributes are necessary to build a highly relevant and 
comprehensive Dark Web collection for intelligence and analysis purposes? 

2. For terrorist/extremist web sites, what are the levels of technical sophistication 
in their system design? 

3. For terrorist/extremist web sites, what are the levels of richness in their online 
content? 

4. For terrorist/extremist web sites, what are the levels of web interactivity to sup- 
port individual, community, and transaction interactions? 

To study the research questions, we propose a Dark Web analysis tool which 
contains several components: a systematic procedure for collecting and monitoring 
Dark Web contents and a Dark Web Attribute System to enable quantitative analysis 
of Dark Web content (see Fig. 8.1). 
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Fig. 8.1 The Dark Web collection-building and content analysis framework 



3.1 Dark Web Collection Building 

The first step toward studying terrorists’ tactical use of the web is to build a high- 
quality Dark Web collection. To ensure the quality of our collection, based on our 
review of web collection-building methodologies, we propose to use a semiautomated 
approach to collecting Dark Web contents (Reid et ah, 2004). Our collection-building 
approach contains the following steps (see Fig. 8.2 for graphical depiction): 

1 . Identify terrorist/extremist groups: Defining terrorism is complicated by the fact 
that people almost never define themselves as terrorists, and the use of the label 
by others often has political overtones. We start the collection-building process 
by identifying the groups that are considered by authoritative sources as terrorist/ 
extremist groups. The sources include government agency reports (e.g., US State 
Department reports, FBI reports, government reports from United Kingdom, 
Australia, Japan, and P. R. China, etc.), authoritative organization reports (e.g., 
Counter-Terrorism Committee of the UN Security Council, US Committee for 
A Free Lebanon, etc.), and studies published by terrorism research centers such 
as the Anti-Terrorism Coalition (ATC), the Middle East Media Research Institute 
(MEMRI), Dartmouth College, etc. Information such as terrorist group names, 
leaders’ names, and terrorist jargon is identified from the sources to create a ter- 
rorism keyword lexicon for use in the next step. 
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Fig. 8.2 The Dark Web collection building approach 
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2. Identify terrorist/extremist group URLs: We manually identify a set of seed ter- 
rorist group URLs from two sources. First, terrorist group URLs can be directly 
identified from the authoritative sources and literature used in the first step. 
Second, terrorist group URLs can be identified by using the terrorism keyword 
lexicon to query major search engines on the web. The identified set of terrorist 
group URLs will serve as the seed URLs for the next step. 

Expand terrorist/extremist URL sets through link and forum analysis: After iden- 
tifying the seed URLs, out-links and in-links of the seed URLs were automati- 
cally extracted using link analysis programs. The out-links are extracted from the 
HTML contents of “favorite link” pages under the seed web sites. The in-links 
are extracted from Google in-link search service through Google API. Automatic 
out-link and in-link expansion is an effective way to expand the scope of our 
collection. We also have language experts who browse the contents of terrorist- 
supporting forums and extract the terrorist/extremist URLs posted by terrorist 
supporters. Because bogus or unrelated web sites can make their way into our 
collection through the expansion, we have developed a robust filtering process 
based on evidence and clues from the web sites. 

3. Aside from sites which explicitly identify themselves as the official sites of a 
terrorist organization or one of its members, a web site that contains even minor praise 
of or adopts ideologies espoused by a terrorist group is included in our collection. 

4. Download terrorist/extremist web site contents: Once the terrorist/extremist web 
sites are identified, a program is used to automatically download all their con- 
tents. Unlike the tools used in previous studies, our program was designed to 
download not only the textual files (e.g., HTML, TXT, PDF, etc.) but also multi- 
media files (e.g., images, video, audio, etc.) and dynamically generated web files 
(e.g., PHP, ASP, JSP, etc.). Moreover, because terrorist organizations set up 
forums within their web sites whose contents are of special value to research 
communities, our program also can automatically log into the forums and down- 
load the dynamic forum contents. The automatic downloading method allows us 
to effectively build Dark Web collections with millions of documents. This 
greatly increases the comprehensiveness of our Dark Web study. 

To keep the Dark Web collection comprehensive and up-to-date, steps 2 to 4 are 

periodically repeated. Collections built using such a recursive procedure can also 

provide information about the evolution and diffusion of the Dark Web. 



3.2 The Dark Web Attribute System (DWAS) 

Instead of using observation-based qualitative analysis approaches (Thomas, 2003), 
we propose a systematic approach to enable the quantitative study of terrorist/ 
extremist groups’ use of the web. The proposed Dark Web Attribute System is simi- 
lar to the WAES framework in Demchak et al.’s study (2000). However, instead of 
the openness attributes used in WAES, our framework focuses on the attributes 
that could help us better understand the level of advancement and effectiveness of 
terrorists’ web usage, namely, technical sophistication attributes, content richness 
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Table 8.3(a) Technical sophistication attributes 



TS attributes 




Weights 


Basic HTML techniques 


Use of lists 


1 




Use of tables 


2 




Use of frames 


2 




Use of forms 


1.5 


Embedded multimedia 


Use of background image 


1 




Use of background music 


2 




Use of stream audio/video 


3.5 


Advanced HTML 


Use of DHTML/SHTML 


2.5 




Use of predefined script functions 


2 




Use of self-defined script functions 


4.5 


Dynamic web programming 


Use of CGI 


2.5 




Use of PHP 


4.5 




Use of JSP/ASP 


5.5 



Table 8.3(b) Content richness attributes 


CR Attributes 


Scores 


Hyperlink 


# Hyperlinks 


File/software download 


# Downloadable documents 


Image 


# Images 


Video/audio file 


# Video/audio files 



Table 8.3(c) Web interactivity attributes 




WI attributes 


Weights 


One-to-one interactivity 
E-mail feedback 


1.75 


E-mail list 


2.25 


Contact address 


1.25 


Feedback form 


2.75 


Guest book 


1.5 


Community-level interactivity 
Private message 


4.25 


Online forum 


4.25 


Chat room 


4.75 


Transaction-level interactivity 
Online shop 


4 


Online payment 


4 


Online application form 


4 



attributes (an extension of the traditional media richness attributes), and web inter- 
activity attributes. Based on previous literatures in e-commerce (Palmer and David, 
1998), e-government (Demchak et al. 2000), and e-education domains (Chou, 
2003), we selected 13 technical sophistication attributes, 5 content richness attri- 
butes, and 1 1 web interactivity attributes for our DWAS framework. A list of these 
attributes is summarized in Tables 8.3a to 8.3c. 



138 



8 Dark Web Attribute System 



1 . Technical sophistication (TS) attributes: The technical sophistication attributes can 
be grouped into four categories as shown in Table 8.3a. The first category of four 
attributes, called the basic HTML technique attributes, measures how well the 
basic HTML layout techniques (i.e., lists, tables, frames, and forms) are applied in 
web sites to organize web contents. The second category, called the embedded 
media attributes, measures how well the web sites deliver their information to the 
user in multimedia formats such as images, animations, and audio/video clips. The 
third category of three attributes, called the advanced HTML attributes, measures 
how well advanced HTML techniques, such as DHTML and SHTML, and pre- 
defined and self-defined script functions (e.g., JavaScript, VBScript, etc.) are 
applied to implement security and dynamic functionalities. The last category, called 
the dynamic web programming attributes, measures how well dynamic web pro- 
gramming languages such as PHP, ASP, and JSP are utilized to implement dynamic 
interaction functionalities such as user login, online request or application, and 
online transaction processing. The four technical sophistication attributes and asso- 
ciated subattributes are present in most of the Dark Web sites we collected. 

The presence of different attributes indicates different levels of technical 
sophistication. For example, a web site which uses JSP techniques should be 
considered more technically sophisticated than a site which only uses static 
HTML. Different weights should be assigned to the attributes to reflect the dif- 
ferences (Chou, 2003). We determined the weights based on web experts’ opin- 
ions collected through an e-mail survey. Surveys were sent to webmasters and 
network administrators of several web sites belonging to the University of 
Arizona, and they were encouraged to forward the survey to their webmaster 
colleagues. In the survey, we asked the experts to give each of our attributes a 
weight of 1-10 (1 is the least advanced/sophisticated). Six experts sent their 
responses back to us. For each attribute, the average weight assigned by the 
experts was used in the final framework. Among the six experts, two are web- 
masters of academia web sites, two are webmasters of commercial web sites, one 
is a web developer in a commercial company, and the last one is a professor 
teaching web development courses in a university. On average, they have 7 years 
of professional experience in web technology. To ensure the reliability of the 
weights, we conducted a reliability test on the experts’ answers. The reliability 
score (Cronbach’s alpha) calculated for the experts’ answers was 0.89 which was 
well above the 0.70 required for acceptable scale reliability (Nunnally, 1978). 
The TS attributes and their weights are summarized in Table 8.3a. 

2. Content richness ( CR) attributes: In traditional media richness studies, research- 
ers only focused on the variety of media used to deliver information (Trevino 
et al., 1987; Palmer and Griffith, 1998). However, to have a deep understanding 
of the richness of Dark Web contents, we would like to measure not only the 
variety of the media but also the amount of information delivered by each type of 
media. In this chapter, we expand the media richness concept by taking the vol- 
ume of information into consideration. More specifically, as shown in Table 8.3b, 
we calculated the average number of four types of web elements as the indication 
of Dark Web content richness: hyperlinks, downloadable documents, images, 
and video/audio files. 
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3. Web interactivity (WI) attributes: For the web interactivity attributes (see 
Table 8.3c), we followed the standard built by the UNPAN and the European 
Commission’s 1ST program as well as Chou’s (2003) work to group the attri- 
butes into three levels: the one-to-one-level interactivity, the community-level 
interactivity, and the transaction-level interactivity. The one-to-one-level interac- 
tivity contains five attributes (i.e., e-mail feedback, e-mail list, contact address, 
feedback form, and guest book) that provide basic one-to-one communication 
channels for Dark Web users to contact the terrorist web site owners (see 
Table 8.3c). The community-level interactivity contains three attributes (i.e., pri- 
vate message, online forum, and chat room) that allow Dark Web site owners and 
users to engage in synchronized many-to-many communications with each other. 
The transaction-level interactivity contains three attributes (i.e., online shop, 
online payment, and online application form) that allow Dark Web users to com- 
plete tasks such as donating to terrorist/extremist groups, applying for group 
membership, etc. The presence of these attributes in the Dark Web sites indicates 
how well terrorists/extremists utilize Internet technology to facilitate their com- 
munication with their supporters. 

Similar to the TS attributes, different weights should be assigned to the WI attri- 
butes to indicate their different levels of support on communications. We asked web 
experts to assign weights of 1 to 10 to the WI attributes in the same e-mail survey 
where the TS attributes’ weights were determined. The WI attributes and their 
weights are summarized in Table 8.3c. 

We developed strategies to efficiently and accurately identify the presence of the 
DWAS attributes from Dark Web sites. The TS and CR attributes are marked by 
HTML tags in page contents or file extension names in the page URL strings. For 
example, an HTML tag “<image>” indicates that an image is inserted into the page 
content. A URL string ending with “.jsp” indicates that the page utilizes JSP tech- 
nology. We developed programs to automatically analyze Dark Web page contents 
and URL strings to extract the presence of the TS and CR attributes. Since there are 
no clear indications or rules that a program could follow to identify WI attributes 
from Dark Web contents with a high degree of accuracy, we developed a set of cod- 
ing schemes to allow human coders to identify their presence in Dark Web sites. 
Technical sophistication, content richness, and web interactivity scores are calcu- 
lated for each web site based on the presence of the attributes to indicate how 
advanced and effective the site is in terms of supporting terrorist/extremist groups’ 
communications and interactions. 



4 Case Study: Understanding Middle Eastern 
Terrorist Groups 

To test our proposed approach, we conducted a case study to collect and analyze the 
web presence of major Middle Eastern terrorist groups. We also conducted a bench- 
mark comparison between the terrorist/extremist web sites and US federal and 
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state government web sites to evaluate the terrorist/extremist organizations’ online 
capabilities. The terrorist/extremist groups we studied mainly include Islamic 
terrorist groups rooted in Middle Eastern countries, for example, al-Qaeda, 
Palestinian Islamic Jihad, Hamas, etc. These terrorist/extremist groups are the focus 
of most current counterterrorism studies. We chose US government web sites as 
benchmarks because government web sites and terrorist/extremist web sites have 
common overall objectives - to inform the public about their goals, programs, and 
strategies. To achieve this objective, similar web features must be implemented in 
both government and terrorist/extremist web sites. Furthermore, the US government 
was ranked the top in the world by the CyPRG group (http://www.cyprg.arizona. 
edu/) in terms of web technical sophistication and interactivity. With the US govern- 
ment web sites as high-standard benchmarks, we can better understand the terrorist/ 
extremist web sites’ levels of technical advancement and effectiveness. 



4.1 Building Dark Web Research Test Bed 

Following the collection-building procedure discussed in Sect. 3.1, we created a 
Middle Eastern terrorist/extremist web site collection and a US government web 
site collection as the test beds for this study. 

The Middle Eastern terrorist/extremist web collection was created in June of 
2004. We identified 36 Middle Eastern terrorist/extremist groups from authoritative 
sources mentioned in Sect. 3.1. Based on the information of these terrorist/extremist 
groups, we constructed a lexicon of Middle Eastern terrorism keywords with the 
help of Arabic language experts. Examples of relevant keywords include terrorist 
leaders’ names such as “Ui_k/C '‘-Vc 1 *- 21 ‘■Al) J'Aj” (“Sheikh Mujahid bin Laden”), ter- 
rorist groups’ names such as “'cs Ju (“Khalq Iran”), and special words used by 

terrorists/extremists such as “c (“Crusader’s War”) and j” 

(“Infidels”). This lexicon was used to query major search engines for identification 
and retrieval of terrorist/extremist groups’ URLs. The URLs identified from the 
search engines, together with the terrorist/extremist URLs listed in the terrorism 
literature and reports, served as seed URLs for the out-link and in-link expansion 
process. We performed a one-level-deep in-link expansion using Google’s in-link 
search tool and a one-level-deep out-link expansion. After carefully filtering the 
expansion results, we obtained the URLs of 86 Middle Eastern terrorist/extremist 
web sites. Using SpidersRUs, a digital library building toolkit developed by our 
group, we collected about 222,000 multimedia web documents from the identified 
terrorist/extremist web sites. 

Table 8.4 summarizes the detailed file-type breakdown of the terrorist/extremist 
collection; 179,223 out of the total 222,687 documents in the terrorist/extremist col- 
lection are indexable files. 

These are textual files such HTML files, plain text files, PDF/Word documents, 
and dynamic files generated by web applications (e.g., ASP, JSP, etc.). Interestingly, 
the majority of indexable files (130,972 files out of 179,223 total files) in the terrorist/ 
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Table 8.4 Middle eastern terrorist/extremist web collection file types 



Terrorist/extremist collection 


# Files 


Volume (bytes) 


Grand total 


222,687 


12,362,050,865 


Indexable files total 


179,223 


4,854,971,043 


HTML files 


44,334 


1,137,725,685 


Word files 


278 


16,371,586 


PDF files 


3,145 


542,061,545 


Dynamic files 


130,972 


3,106,537,495 


Text files 


390 


45,982,886 


PowerPoint files 


6 


6,087,168 


XML files 


98 


204,678 


Multimedia files total 


35,164 


5,915,442,276 


Image files 


31,691 


525,986,847 


Audio files 


2,554 


3,750,390,404 


Video files 


919 


1,230,046,468 


Archive files 


1,281 


483,138,149 


Nonstandard files 


7,019 


1,108,499,397 



extremist collection are dynamic files. We conducted a preliminary analysis on the 
contents of these dynamic files and found that most dynamic files were forum post- 
ings. This indicates that online forums play an important role in terrorists/extremists’ 
web usage. Other than indexable files, multimedia files also make a significant pres- 
ence in the terrorist/extremist collection. While the quantity of multimedia files is not 
as large as the indexable files, multimedia files are the largest category in the collec- 
tion in terms of their volume. This indicates heavy use of multimedia technologies in 
terrorist/extremist web sites. The last two categories, archive files (1,281 files) and 
nonstandard files (7,019 files), made up less than 5% of the collection. Archive files 
are compressed file packages such as .zip files and .rar files. They could be password 
protected. Nonstandard files are files that cannot be recognized by the Windows oper- 
ating system. These files may be of special interest to terrorism researchers and 
experts because they could be encrypted information created by terrorists/extremists. 
Further analysis is needed to study the contents of these two types of files. 

The benchmark US government web collection was built in July of 2004. All 92 
federal and state government URLs under Yahoo !’s “Government” category were 
selected as seed URLs. Around 277,000 web documents were automatically col- 
lected from these government web sites using the SpidersRUs toolkit. The detailed 
file type breakdown of the US government web collection is summarized in 
Table 8.5. The file-type distribution of the government collection is similar to the 
terrorist/extremist collection. Indexable files (221,684 files) are the largest category, 
the majority of which are dynamic files (145,590 files). However, in the government 
collection, we did not find as many forum postings as in the terrorist/extremist col- 
lection. Many dynamic files in the government collection are articles dynamically 
retrieved from large-document databases at users’ requests. Multimedia files also 
have a significant presence in the government collection, indicating heavy multime- 
dia usage in government web sites. 
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Table 8.5 US government web collection file types 



US government collection 


# Files 


Volume (bytes) 


Grand total 


277,274 


19,341,345,384 


Indexable files total 


221,684 


6,502,288,302 


HTML files 


71,518 


2,632,912,620 


Word files 


298 


210,906,045 


PDF files 


841 


663,293,376 


Dynamic files 


145,590 


2,071,734,849 


Text files 


2,878 


555,403,447 


Excel files 


4 


98,560 


PowerPoint files 


5 


725,017 


XML files 


554 


367,214,389 


Multimedia files total 


49,582 


10,835,029,216 


Image files 


45,707 


850,011,712 


Audio files 


3,429 


8,153,419,931 


Video files 


449 


1,831,597,573 


Archive files 


538 


286,312,990 


Nonstandard files 


5,471 


1,717,714,876 



4.2 Collection Analysis and Benchmark Comparison 

Following the DWAS approach, the presence of technical sophistication and media 
richness attributes was automatically extracted from the collections using programs. 
The presence of web interactivity attributes was extracted from each web site by 
language experts based on the coding scheme in DWAS. Because of the time limita- 
tion, language experts examined only the top two levels of web pages in each web 
site. For each web site in the two collections, three scores (technical sophistication, 
content richness, and web interactivity) were calculated based on the presence of 
the attributes and their corresponding weights in DWAS. Statistical analysis was 
conducted to compare the advancement/effectiveness scores achieved by the terror- 
ist/extremist collection and the US government collection. 



4.2.1 Benchmark Comparison Results: Technical Sophistication 

The technical sophistication comparison results are shown in Table 8.6. The results 

showed that: 

• The US government web sites are significantly more advanced than the terrorist 
web sites in terms of basic HTML techniques (p< 0.0001). Government agencies 
paid a great deal of attention to the design of their web sites, and they used many 
of the HTML features to organize their web contents. Terrorists/extremists, on 
the other hand, did not organize the contents on their web sites very well. 

• The US government web sites are significantly more advanced than the terrorist 
web sites in terms of utilizing dynamic web programming languages (p = 0.0066). 
Most government web sites employed web programming technologies (e.g., PHP, 
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Table 8.6 Technical sophistication comparison results 



TS attributes 


Weighted average 
US 


score 

Terrorists 


t-Test result 


Basic HTML techniques 


0.9130434 


0.710526 


p < 0 . 0001 ** 


Embedded multimedia 


0.565217 


0.833333 


p = 0 . 0027 ** 


Advanced HTML 


1.789855 


1.771929 


p = 0.139 


Dynamic web programming 


2.159420 


1.407894 


p = 0 . 0066 ** 


Average 


1.356884 


1.180921 


p = 0.06 



**Significant level is at 0.05 



ASP, JSP, etc..) to implement functionalities such as user login, online applica- 
tion, online purchase, etc. Few terrorist/extremist web sites implemented such 
dynamic functionalities. 

• There is no significant difference between the terrorist web sites and the US 
government web sites in terms of applying advanced HTML techniques at a sig- 
nificant level of 0.05 {p= 0.139). 

• The terrorist web sites have a significantly higher level of embedded media usage 
than the US government web sites (p = 0.0027). This unique characteristic of ter- 
rorist/extremist web sites is discussed in detail below. 

• When taking all four sets of attributes into consideration, there is no significant 
difference between the technical sophistication of the Middle Eastern terrorist 
web sites and the US government web sites at a significant level of 0.05 
0 = 0.06). 

The extensive use of media in terrorist/extremist groups’ web sites is of special 
interest. While the terrorist/extremist groups are not as good as the US government 
in terms of organizing their web pages into clear layouts or implementing dynamic 
web functionalities, they employed a significantly higher level of embedded multi- 
media techniques, especially images and audio/video clips, to catch the interest of 
their target audience. In the terrorist/extremist groups’ collection, 46% of the web 
sites embedded audio/video clips into their pages, while only 29% of the US gov- 
ernment web sites provided audio/video clips. 

Multimedia content is more attractive and tends to leave a stronger impression on 
people than pure textual content. For example, the militant Islamic group Hamas 
foments a violent resistance to their “enemies” by disseminating graphic posters on 
their web sites (see Fig. 8.3). Moreover, terrorists often post images, audio, or video 
clips from their leaders or martyrs to boost the spirits of their members and support- 
ers. For example, Osama bin-Laden’s portrait appears on homepages of many 
Middle Eastern terrorist/extremist web sites. Recently, posters of the Iraqi terrorist 
leader Abu Mus’ab Zarqawi, who is suspected to be responsible for the beheading 
of several Western hostages, can also be found in Middle Eastern terrorist web sites 
(see Fig. 8.4). These posters explicitly mention that Abu Mus’ab Zarqawi is a 
“beheader” and praise his brutal killing of innocents as a way to protect Iraq. 
Terrorists/extremists also post images and audio/video clips of their “martyrdom 
operations” as a way to demonstrate their resolve to fight their enemies and inspire 
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Fig. 8.3 A Hamas poster 
inviting men to join their 
military struggle. The text on 
the poster says, “Have you 
fought for the sake of God? 
You say no. Then you should 
have your mouth shot” 
(Source: http://www. 
palestine-info.com) 
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Fig. 8.4 A poster depicting terrorist leader in Iraq, Abu Mus’ab Zarqawi. The text on the poster 
says, “Emir Zarqawi, may God save him. Eagle of Iraq, volcano of jihad, and the beheader” 
(Source: http://www.islamic-f.net/vb/) 
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Table 8.7 Content richness comparison results 





Average counts per sites 






CR attributes 


US 


Terrorists 


t-Test result 


Hyperlink 


3513.254654 


3172.658483 


p <0.0001** 


Downloaded documents 


400.9674532 


151.868427 


p = 0.0103** 


Image 


582.352456 


540.0484563 


p = 0.466 


Video/audio file 


91.55434783 


50.9736828 


p <0.0001** 



**Significant level is at 0.05 



their supporters. Many movie clips of several suicide bombing attacks in Iraq were 
posted by terrorists in one of the terrorist online forums (http://wwwlb.dm.net.lb/ 
ubb/Forum4/) to show off their “triumph over the US invaders.” The “Fighting 
Islamic Group” posted a set of detailed documentations with pictures describing 
their assassination attempt of Libyan president Mu’amar Khadafi and praising the 
“heroism” of their members. 

The multimedia content posted on terrorist/extremist web sites is not only for 
terrorist supporters but for enemies. For example, the video clip of American 
Nicholas Berg being beheaded was spread to the public from a Malaysian terrorist 
web site. The video of the final minutes of another American hostage, Robert Jacobs, 
was first posted on Middle Eastern militant group web sites. We also found that an 
Iraqi terrorist/extremist group posted pictures of executed “traitors” on their web 
sites, warning other Iraqi people not to cooperate with the US Forces. Materials of 
such nature are usually considered to be too shocking to televise by most TV news 
producers. However, through the Internet, terrorists/extremists have successfully 
spread these gruesome materials to as many people as possible, especially in the 
West where Internet use is more common. 



4.2.2 Benchmark Comparison Results: Content Richness 

The content richness comparison results are summarized in Table 8.7. The results 

showed that: 

• The US government web sites provided significantly more hyperlinks ( p < 0.000 1 ), 
downloadable documents (p = 0.0103), and video/audio clips (/?< 0.0001 ) than 
the terrorist/extremist web sites. 

• The US government web sites provided more images than the terrorist/extremist 
web sites, but the difference is not significant at a significant level of 0.05. 

• Overall, the terrorist/extremist web sites are not as good as the US government 
web sites in terms of content richness {p< 0.0001) because the volume of con- 
tents in terrorist/extremist web sites is often smaller than US government web 
sites. 
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The content richness comparison results are not contradictory with the technical 
sophistication comparison results. The content richness results showed that the US 
government web sites provide a larger volume of multimedia content; while the 
technical sophistication results indicated that a higher percentage of terrorist/extrem- 
ist groups’ web sites provide multimedia contents. The terrorist/extremist web sites 
also utilize more advanced technology to deliver their multimedia contents. 

One possible explanation for the smaller volume of multimedia content provided 
by the terrorist/extremist groups’ web sites is the lower capacity and instability of 
terrorists/extremists’ web servers. Unlike the US government web sites which are 
usually hosted on dedicated web servers, many of the terrorist/extremist groups’ web 
sites in our collection are hosted on web servers provided by free public ISPs such as 
GeoCities. The public web servers usually have restrictions on the size and band- 
width of the web sites they host. The restrictions would limit terrorist/extremist 
groups’ ability to host multimedia information on their web sites. The instability of 
the terrorist/extremist groups’ web sites also makes it difficult for them to host multi- 
media information. Many web sites frequently move their web contents to other web 
servers because their old sites were shut down by ISPs or hacked. While textual web 
pages can be quickly and easily duplicated to the new servers, multimedia documents 
are more difficult to transfer and more prone to loss because of their larger size. 

Nevertheless, terrorist/extremist groups still manage to host a considerable 
amount of downloadable documents and multimedia information on their web sites. 
These media cover a wide variety of topics ranging from propaganda campaigns to 
tutorials of weapon operations and guerilla tactics. For example, the web site of 
extremist cleric Sheikh Hamed A1 Ali (see Fig. 8.5) hosts a list of audio clips con- 
sisting of preaching in the Salafi ideology and political issues. The Anbaar Iraqi 
terrorist/extremist group’s web site (see Fig. 8.6) provides a collection of songs and 
hymns praising the “Holy war” that they are conducting. 



4.2.3 Benchmark Comparison Results: Web Interactivity 

Table 8.8 summarizes the web interactivity comparison results. The results showed 
that: 

• In terms of supporting one-to-one-level interactivity, the US government agen- 
cies are doing significantly better than terrorist/extremist web sites by providing 
their contact information (e.g., e-mail, mail address, etc.) on their sites {p = 0.024). 
Because of their covert nature, terrorist/extremist groups seldom disclose their 
contact information on their web sites. 

In terms of supporting community-level interactivity, terrorist/extremist web 
sites are doing significantly better than government web sites. 

• By providing online forums and chat rooms (p = 0.0025). Few government agen- 
cies provided such online forum and chat room support on their web sites. 

• Our experts did not identify transaction-level interactivity in terrorist/extremist 
web sites, although such interactivity might be hidden in their sites. 
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Fig. 8.5 A list of audio clips from the web site of extremist cleric Sheikh Hamed A1 Ali which 
consists of preaching in the Salafi ideology and political issues (Source: http://www.h-alali.net) 




Fig. 8.6 “Holy war” songs and hymns presented on Anbaar Iraqi terrorist/extremist group’s web 
site audio section (Source: http://www.anbaar.net/audio) 




148 



8 Dark Web Attribute System 



Table 8.8 Web interactivity comparison results 



WI attributes 


Weighted average score 
US Terrorist 


t-Test result 


One-to-one 


0.342857 


0.292169 


p = 0.024** 


Community 


0.028571 


0.168675 


p = 0.0025** 


Transaction 


0.3 


Not presented 




Average (transaction not included) 


0.185714 


0.230422 


p = 0.056 



**Significant level is at 0.05 



vasa 

o- 



^ ^ — *- 



Posted by Abu Acid (forum member): Allah 
(God) is the greatest, a new message from 
sheikh Abu Mus’ab Zarqawi, the Emir of al- 
Qaeda in Mesopotamia (Iraq). 




The message is in “zip” 
format and can be 
downloaded from multiple 
external servers. 



Fig. 8.7 Discussion forums are used to share important messages from terrorist leaders among the 
members of the terrorist groups and their supporters (Source: www.islamic-f.net) 



• Taking both one-to-one and community-level interactivity into consideration, we 
did not find significant difference between the terrorist/extremist web sites and 
the US government web sites (p = 0.056) at a significant level of 0.05. 

Several previous studies implied that terrorists are relying on Internet-based com- 
munication tools such as online chat rooms and forums to facilitate their daily com- 
munication, command and control, and even operation planning and coordination 
(Zhou et al., 2005; Whine, 1999; FBIS, 1995). Our results further confirmed these 
observations. The Middle Eastern terrorist/extremist groups are very active in terms 
of hosting and maintaining online forums and bulletin boards. Among the largest 
terrorist-supporting forums that we have been monitoring, www.shawati.com has 
31,894 registered forum members and 418,196 posts; www.kuwaitchat.net has 
11,531 registered members and 624,694 posts. Not all of the forum members are 
terrorists or extremists. Many of them are just supporters or sympathizers. Members 
of these large forums participate in daily discussions, express their support of the 
terrorist groups, and reinforce each other’s beliefs in the terrorist/extremist groups’ 
courses. They sometimes can get messages directly from active members of terrorist/ 
extremist groups. For example, messages from the Iraqi terrorist leader Abu Mus’ab 
Zarqawi can often be found at the online forum www.islamic-f.net (see Fig. 8.7). 
These dynamic forums provide snapshots of terrorist/extremist groups’ activities, 
communications, ideologies, relationships, and evolutionary developments. 
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5 Conclusions and Future Directions 

In this chapter, we proposed a systematic procedure to collect Dark Web contents 
and a Dark Web Attribute System (DWAS) to enable quantitative analysis of terror- 
ists’ tactical use of the Internet. The automatic collection building and content anal- 
ysis components used in the proposed methodology allow the efficient collection 
and analysis of thousands of Dark Web documents. This enables our Dark Web 
study to achieve a higher level of comprehensiveness than previous manual 
approaches. Furthermore, the DWAS is a systematic content analysis tool that, we 
believe, brings more insights into terrorist/extremist groups’ Internet usage than 
previous observation-based studies provided. 

Using the proposed collection-building procedure and framework, we built a 
high-quality Middle Eastern terrorist/extremist group web collection and bench- 
marked it against the US government web site collection. The results showed that 
terrorist/extremist groups adopted levels of web technologies similar to that of US 
government agencies. Moreover, terrorists/extremists had a strong emphasis on 
multimedia usage, and their web sites employed significantly more sophisticated 
multimedia technologies than government web sites. We also found that terrorists/ 
extremists seem to be as effective as the US government agencies in terms of sup- 
porting communication and interaction using web technologies. More specifically, 
terrorists/extremists make heavy use of web forums to facilitate their communica- 
tion and coordination. 

Our study provides insights for policy makers to better apply counterterrorism 
measures on the web. Our results showed that Internet technologies, especially 
forums and chat rooms, have become a major means for terrorists/extremists to 
reach out to a broad audience. They have invested a significant amount of effort and 
technical expertise into building their web infrastructure. Security and law enforce- 
ment experts should pay more attention to terrorists/extremists’ online communica- 
tion. We identified very high levels of communicative activities in terrorist/extremist 
forums in our collection. Some documents in our collection were not readable using 
conventional applications. Some of these documents might contain hidden informa- 
tion from terrorists/extremists. Monitoring and deciphering such hidden messages 
could help disrupt terrorist/extremist communication and prevent terrorism attacks. 
Furthermore, we believe that the proposed Dark Web research methodology could 
contribute to the terrorism research domain. The richness of the Dark Web contents 
calls for more studies being devoted to this domain to help enrich our understanding 
of terrorists/extremists’ Internet usage, online propaganda campaigns, and their psy- 
chological warfare strategies. 

We have several future research directions to pursue. First, we plan to experi- 
ment with better data analysis methods and collaborate with more terrorism/extrem- 
ism domain experts to better analyze and interpret our study results. For example, 
for the content richness comparisons, we would like to conduct a more detailed 
study to compare the richness of terrorist/extremist web sites to government web 
sites based on the percentage of each type of media in the overall contents. We also 
plan to conduct a cross-comparison which takes both the TS and WI attributes into 
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consideration to gain more insight about the correlation of these attributes. Second, 
we plan to cooperate with web technology experts to further improve the DWAS by 
incorporating additional attributes and adjusting the relevant weights. Third, we 
plan to expand the scope of our study by conducting a comparative analysis of ter- 
rorist/extremist groups’ web sites across different regions of the world. We also plan 
to conduct a time series analysis study on the Dark Web to analyze the evolution and 
diffusion of terrorist/extremist groups’ web presence. Last but not least, we also 
plan to explore more advanced machine-learning techniques to detect the technol- 
ogy and media usage patterns in terrorist/extremist web sites to gain more insights 
into terrorists/extremists’ technology usage. 
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Chapter 9 

Authorship Analysis 



1 Introduction 

Analysis of web content is becoming increasingly important due to amplified 
communication via Internet sources such as e-mail, web sites, and forums. The 
anonymous nature of these channels makes them an ideal source of contact for 
militant groups and terrorist organizations. Furthermore, the global nature of criminal 
activity necessitates the exploration of online communication in a multilingual con- 
text. Application of authorship analysis techniques across multilingual web media 
is of dire importance for assisting in the identification and prevention of potential 
criminal activity with national security implications. 

Specifically, Arabic has garnered greater attention in recent years for sociopolitical 
reasons and ties between Middle Eastern groups and terrorism; however, there has 
been an absence of studies aimed at applying authorship techniques across the 
language. The morphological challenges pertaining to Arabic pose several critical 
problems for authorship identification, which could be partially responsible for the 
lack of previous research relating to Arabic. 

We modified an existing framework for the application of authorship analysis of 
online messages and applied it to Arabic and English web forum messages associ- 
ated with extremist groups. Special multilingual components were developed for 
the extraction and identification of Arabic messages. These components were geared 
toward addressing the unique characteristics of the language. Furthermore, a com- 
plex message extraction component was incorporated in order to allow the use of a 
more comprehensive set of features tailored specifically toward online messages. 
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2 Literature Review: Authorship Analysis 

Authorship analysis is the process of evaluating writing characteristics in order to 
make inferences about authorship. It is rooted in the linguistic area known as stylom- 
etry, which is defined as the statistical analysis of literary style. Two major categories 
of authorship analysis are authorship identification and authorship characterization. 

Authorship identification deals with attributing authorship of unidentified writing 
based on stylistic similarities between the author’s known works and the unidenti- 
fied piece. In contrast, authorship characterization attempts to formulate an author 
profile by making inferences about gender, education, and cultural background 
based on writing style. Authorship identification deals with classification problems, 
whereas authorship characterization is used for clustering. 

We are primarily concerned with the application of authorship identification to 
English and Arabic online messages. In order to apply authorship identification 
successfully, we had to begin by determining the relevant features and techniques to 
utilize. Unfortunately, the lack of consensus in previous authorship analysis litera- 
ture coupled with the application of identification methodologies to a new language 
made this task an arduous endeavor. 



2.1 Writing Style Features 

Writing style features are characteristics that can be derived from a message in order 
to facilitate authorship attribution. Numerous types of features have been used in 
previous studies including n-grams and the frequency of spelling and grammatical 
errors; however, four categories used extensively are lexical, syntactic, structural, 
and content-specific features. 

Lexical features can be broken into two categories: word-based and character-based. 
Word-based lexical features include total number of words, words per sentence, 
word-length distribution, vocabulary richness, etc. Character-based lexical features 
include total number of characters, characters per sentence, characters per word, 
and the usage frequency of individual letters. 

Syntax refers to the patterns used for the formation of sentences. This category of 
features is comprised of the tools used to structure sentences, such as punctuation 
and function words. Examples of function words are “while” and “upon.” Usage 
patterns of function words can be effective features for authorship identification. 
For example, the difference between using the word “thus” or “hence” may appear 
subtle but can constitute a significant stylistic difference. 

Structural features deal with the organization and layout of the text. This set of 
features has been shown to be particularly important for online messages (De Vel 
et al. 2001). Previously, structural features used have focused on word structure 
such as the use of greetings and signatures, or the number of paragraphs and aver- 
age paragraph length. While these features are important discriminators, online 
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messages also provide additional information such as fonts, images, and links, 
which are not captured. Although these characteristics are not writing style features 
per se, they provide important insight into the authorship characteristics of the 
writer. The use of various font sizes and colors requires a conscientious effort, hence 
making it a style marker. Similarly, the ability to embed images and icons or pro- 
vide links to different types of web sites can be a reflection of technical prowess. 
Evaluation of technical characteristics measured in terms of the use of images, 
hyperlinks, and audio or video media is not a novel idea; it has been applied to web 
sites (Palmer and Griffith 1998). Thus, we propose a new subcategory of structural 
features called technical structure which encompasses font, hyperlink, and embed- 
ded image characteristics. 

Content-specific features are words that are important within a specific topic domain. 
An example of content-specific words for a discussion on computers might be 
“RAM” and “laptop.” The rationale for content-specific words is similar to that of 
other word usage features but at a finer level of granularity. 



2.2 Analysis Techniques 

The two most commonly used analytical techniques for authorship attribution are 
statistical and machine learning approaches. Many multivariate statistical 
approaches, such as principal component analysis, have been shown to provide a 
high level of accuracy. However, some of the pitfalls associated with statistical 
approaches include the need for more stringent models and assumptions. 

Drastic increases in computational power have caused the emergence of machine 
learning techniques such as support vector machines (SVM), neural networks, and 
decision trees. These techniques have gained wider acceptance in authorship analy- 
sis studies in recent years (Zheng et al. 2003). Machine learning approaches provide 
greater scalability in terms of number of features that can be handled and are less 
susceptible to noisier data as compared to statistical techniques. These benefits are 
ideal for identification of online messages which entails classification of more 
authors and a large feature set. 



2.3 Online Messages 

Online messages pose several problems for authorship identification as compared to 
conventional forms of writing, with the biggest concern being message length. 
Writing style markers are far less visible for messages shorter than a few hundred 
words, with identification becoming cumbersome or even impossible in such situa- 
tions. The problem is further amplified by the larger pool of potential authors in 
online attribution situations. 

Additional difficulties associated with online messages center around the casual 
style of online communication. E-mail and forum postings tend to be less formal 
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than traditional writing, resulting in more misspellings, unorthodox structure, 
increased use of abbreviations, and improper use of punctuation. As a consequence, 
application of authorship identification to web content is intrinsically faced with a 
quagmire of noisy data. 

Despite all the challenges, the unique structural characteristics of online messages 
may also provide helpful discriminators for identification. Greetings, signatures, 
quotes, links, and the use of contact information (phone, e-mail) can offer signifi- 
cant insight for authorship identification. As previously elaborated, this set of 
features can be further enhanced with the inclusion of technical structure features. 



2.4 Multilingual Issues 

Applying authorship identification across different languages is becoming increasingly 
important due to the rapid proliferation of the Internet and the ensuing threats that are 
created. Online analysis of Arabic is especially important due to the emergence of ter- 
rorist organizations such as al-Qaeda. Nevertheless, there has been a lack of multilin- 
gual research with the exception of a few studies done on Greek and Chinese (Peng 
et al. 2003; Stamatatos et al. 2001; Zheng et al. 2003). The language dimension can 
create enormous challenges for authorship identification since previously applied fea- 
tures and techniques were designed for English. For example, the lack of word segmen- 
tation in Chinese makes word-based lexical features (e.g., number of words in a 
sentence) difficult to extract. Additionally, the larger volume of words in Chinese makes 
vocabulary richness measures less effective. Similarly, Arabic also poses some unique 
challenges with respect to the structural and stylistic properties of the language. 



3 Arabic Language Characteristics 

Arabic is a Semitic language belonging to the group of Afro- Asian languages. Semitic 
languages have several characteristics that can cause difficulties for authorship analysis, 
including properties such as inflection, diacritics, word length, and elongation. 



3.1 Inflection 

Inflection is the derivation of stem words from a root. There are approximately 
5,000 roots in Arabic with each root being a 3-5-letter consonant combination 
(Beesley 1996). Stems are created by adding affixes (e.g., prefixes) to the root. Over 
85% of Arabic words are derived from roots, and words with common roots are 
semantically related (Al-Fedaghi and Al-Anzi 1989). The orthographical and mor- 
phological properties of Arabic result in a great deal of lexical variation since words 
can take on numerous forms (Larkey and Connell 2001). Inflection creates feature 
extraction problems due to the larger number of possible words, impacting vocabu- 
lary richness measures (Zheng et al. 2003). 
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Root 



Stem Words 




Fig. 9.1 Inflection example 



Figure 9.1 shows an inflection example demonstrating the derivation of two 
words (KTAB, MKTB) from the root KTB. For the root and stems, the top row 
shows the word written using English alphabet characters, and the second row 
shows the word written in Arabic. Since Arabic letters are joined, making it difficult 
for non- Arabic readers to decipher individual letters, the third row shows the decom- 
posed Arabic word in parentheses. The words KTAB (book) and MKTB (desk) are 
created with the addition of the infix “A” and the prefix “M,” respectively. 



3.2 Diacritics 

Diacritics are markings above or below letters used to indicate special phonetic values. 
An example of diacritics in English would be the little mark found on top of the letter 
“e” in the word “resume.” These markings alter the pronunciation and meaning of the 
word. Arabic uses diacritics in every word to represent short vowels, consonant 
lengths, and relationships between words; however, they are rarely used in online 
communication. Although readers can use the sentence semantics to decipher proper 
meaning, this is not feasible for an automated extraction program. For example, the 
words “resume” and “resume” would look identical to a computer without diacritics. 
The lack of diacritics can significantly impact the effectiveness of word-usage-based 
features such as function words. For example, in Arabic, it is impossible to differentiate 
between the words “who” and “from” without diacritics. 



3.3 Word Length and Elongation 

The shorter length of Arabic words as compared to English words may result in a 
reduction in the impact of many lexical features. Word-length features are less 
effective since Arabic word-length distributions have a smaller range. While the use 
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Table 9.1 Elongation example 



Elongated 


English 


Arabic 


Word length 


No 


MZKR 


jSio 


4 


Yes 


M — ZKR 




8 



of lengthier words in English can sometimes be associated with greater writing 
complexity, this assumption does not hold true for Arabic. Additionally, Arabic 
words are sometimes elongated for purely stylistic reasons using a special character 
that resembles a dash Since Arabic characters are combined during writing, 

elongation is possible by lengthening the joins between letters. Although elongation 
provides an important authorship style marker, it can also create problems as illus- 
trated by Table 9.1. The word MZKR (“remind”) is extended with the addition of 
four dashes between the “M” and the “Z” (denoted by a faint oval) resulting in 
doubling of the word size. 

Elongation can significantly inflate the values of word-length features. Handling 
elongation in terms of feature extraction is an important issue that must be resolved. 



4 Research Questions and Research Design 

We designed a series of experiments to test the efficacy of authorship identification 
techniques in an online setting. The objective of our experiments was to address 
numerous questions including: 

• Will authorship analysis techniques be applicable in identifying authors in Arabic? 

• What are the effects of using different types of features in identifying authors? 

• Which classification techniques are appropriate for authorship analysis? 

• How does identification performance differ between English and Arabic? 

• What are the important feature differences between the English and Arabic 
groups and language models? 



4.1 Test Bed 

Our test bed consisted of English and Arabic datasets extracted from web forum 
messages. In both instances, 20 messages were extracted for each of 20 authors, 
resulting in a total of 400 messages per language. The English dataset had an average 
message length of 76.6 words, while the Arabic average message length was 580.69 
words. The English messages were derived from a forum of a chapter of the White 
Knights of the Ku Klux Klan. The content associated with this group revolved 
around political, racial, and religious issues. Members commonly used profanities 
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and advocated the use of violence against groups whom they detested. In addition to 
general anger and animosity, there were also disturbing references to specific members 
of society. In some instances, complete contact information including address was 
provided for these targeted individuals. 

The Arabic dataset was extracted from forum messages associated with an 
extremist group called the Al-Aqsa Martyrs. These messages had a strong anti- 
American slant with members providing lengthy arguments in favor of their vantage 
point. There was an abundant use of embedded images and links relating to the war 
in Iraq and treatment of al-Qaeda prisoners. The image content was extremely 
graphic in nature and intended to be used as supporting material for the authors’ 
central arguments. Much like the English messages, authors in the Arabic forum 
advocated the infliction of physical harm upon groups whom they disliked. 



4.2 Analysis Techniques 

In this chapter, we adopted two machine learning classifiers: C4.5 and SVM. C4.5 
is a powerful decision tree-based classifier that has been shown to rival other tech- 
niques in terms of performance. SVM was developed on the premise of the structural 
risk minimization principle derived from computational learning theory and has 
gained popularity in the past decade. 

Both techniques have been applied in previous authorship analysis research 
(Zheng et al. 2003). We incorporated SVM for its classification power and robustness. 
SVM is able to handle a large number of input values with great ease due to its 
capacity for dealing with noisy data. C4.5 was used for its analytical and explanatory 
potential. Decision trees provide an effective way to assess key differences between 
the English and Arabic feature sets. 



4.3 Addressing Arabic Characteristics 

In order to create an effective Arabic feature set, we had to address the morphological 
and orthographical properties of the language. Overcoming the diacritics problem 
would require the use of a semantic tagger. Since no feasible solutions exist, we 
decided to focus our attention toward the other challenges, namely, inflection, elon- 
gation, and word length. 



4.3.1 Inflection 

Word roots have been shown to provide superior performance to normal Arabic 
words in information retrieval. As a result of heavy inflection in Arabic, root indexing 
outperforms word indexing on both precision and recall (Hmeidi et al. 1997). 



160 



9 Authorship Analysis 



We complemented our feature set by tracking usage frequencies of a select set of 
word roots. The use of word roots was intended to help compensate for the loss in 
effectiveness of vocabulary richness measures. Tracking root frequencies required 
matching words to their appropriate roots which could be accomplished using a 
clustering algorithm. 

De Roeck and Fares (2000) created a clustering algorithm specifically designed 
for Arabic, consisting of five steps. However, this algorithm is meant to compare 
words against other words as opposed to roots. Since we are comparing words 
against a list of roots (an easier task), not all parts of the algorithm are necessary. We 
adapted the algorithm by using three of the five steps, including blank insertion, 
cross, and Jaccard’s similarity score equation. 

Root frequencies were extracted by calculating similarity scores for each word 
against a dictionary containing over 4,500 roots. Words were assigned to the root 
with the highest similarity score, and the usage frequency of the selected root was 
incremented. An important issue was determining the number of roots to include in 
the final feature set. A trial-and-error approach was used since such methods have 
been used in other multilingual authorship studies due to a lack of previous research 
(Stamatatos et al. 2001). In order to determine the number of roots to include, between 
0 and 500 of the most frequently occurring roots were added to the complete Arabic 
feature set. The classification power of these roots was tested using SVM as the 
classifier. The optimal number (50 roots) was integrated into the feature set. 



4.3.2 Word Length and Elongation 

Arabic words tend to be shorter than English words, with lengthier words (longer 
than ten characters) less common in Arabic. However, elongation of Arabic words 
can distort word-length distributions by artificially inflating them. The use of elon- 
gation is an important authorship style marker, and hence, the occurrence and degree 
(extent of stretching) of elongation should be tracked. However, a filter should be 
embedded into the feature extractor to remove elongation once it has been tracked 
in order to allow for precise capturing of word length. 



4.4 Feature Sets 

The English feature set was adapted from previous online authorship studies (De 
Vel et al. 2001; Zheng et al. 2003) and was composed of 301 features including 87 
lexical, 158 syntactic, 45 structural, and 1 1 content-specific features. The major dif- 
ference from prior studies was that our feature set was enhanced with the inclusion 
of technical structure features. The technical structure features fell into four catego- 
ries: font color, font size, and the use of embedded images and hyperlinks. 

Inspection of the datasets revealed that there were 15 different font colors used 
in the English messages and over 120 in the Arabic! A closer look showed that 
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Table 9.2 Key differences between English and Arabic feature sets 



Feature type 


Feature 


English 


Arabic 


Lexical 


Short word count 


<=3 


<=2 




Word-length distribution 


1-20 


1-15 




Elongation 


N/A 


2 


Syntactic 


Function words 


150 


200 




Word roots 


N/A 


50 


Structural 


Technical structure 


31 


48 



many of the Arabic font colors were minor modifications of standard colors resulting 
in an inflated count. Since most of these modified colors were seldom used, we felt 
that they should not be included in the feature set in order to avoid overfitting. The 
consolidated color count consisted of 12 colors for English and 29 for Arabic. Other 
technical structure features consisted of 8 font size, 4 embedded image, and 7 hyper- 
link features. 

The Arabic feature set, shown in Fig. 9.2, was modeled after the English feature 
set. It was composed of 418 features, including 79 lexical, 262 syntactic, 62 struc- 
tural, and 15 content-specific features. 

The differences between the English and Arabic feature sets are highlighted by 
Table 9.2. In order to compensate for the lack of diacritics and inflection, a larger 
number of function words and 50 word roots were used. A smaller word-length 
distribution and short word threshold were also included in the Arabic dataset. 
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5 Authorship Identification Procedure 



The complete online authorship identification process consisted of three main steps: 
collection, extraction, and experimentation. Figure 9.3 shows the complete process 
design for Arabic authorship identification. 



5.1 Collection and Extraction 

Web forums were identified using spidering programs that crawled through the 
Internet searching for “Dark Web” material, which is content involving potentially 
dangerous or criminal activity that may be of interest for cybercrime and homeland 
security-related issues. Once the forums were recognized, collection programs 
stored the messages in text and HTML format. Extraction programs then derived 
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Fig. 9.3 Authorship identification procedure for Arabic 
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writing style characteristics identified in the feature sets from each message. The 
Arabic feature extractor was a bit more complex than the English one, owing to 
the need to account for elongation and inflection. An elongation filter, clustering 
algorithm, and root dictionary were integrated into the Arabic extraction process. 



5.2 Experiment 

Once the feature values were extracted, they were categorized into four feature sets. 
The first feature set (FI) consisted of lexical features, while the second (F2) encom- 
passed lexical and syntactic features. Structural features were added to the first two 
groups in the third feature set (F3), and content-specific features were inserted with the 
other three categories in the fourth set (F4), which consequently contained all features 
(lexical, syntactic, structural, and content specific). Such a stepwise increment of fea- 
tures was utilized due to our perceptions concerning the order of importance of feature 
categories. Studies have shown that lexical and syntactic features are the most impor- 
tant categories and hence form the foundation for structural and content-specific fea- 
tures. We applied this concept in our design for testing the relevance of feature categories 
for online English and Arabic messages. For the experiment, we created 30 randomly 
selected samples of five authors which were used in all experiments. Each sample of 
five authors was evaluated using all 20 messages per author and 30-fold cross-valida- 
tion with C4.5 and SVM. The overall accuracy was the average precision (# correctly 
identified/total messages) across the 30 samples. The feature type and classification 
accuracies were evaluated using pairwise t tests across the samples (n = 30). 



6 Results and Discussion 



Authorship identification accuracy results for the comparison of the different feature 
types and techniques are summarized in Table 9.3. The overall accuracies were 
exceptional, especially considering the difficult nature of the task and in comparison 
to previous authorship studies. Perhaps most surprising was the relatively small 
drop in performance across languages. In both datasets, the accuracy kept increasing 
with the addition of more feature types. The maximum accuracy was achieved with 
the use of SVM and all features for English and Arabic. 



Table 9.3 Accuracy for different feature sets across techniques 
English dataset Arabic dataset 



Features 


C4.5 (%) 


SVM (%) 


C4.5 (%) 


SVM (%) 


FI 


85.76 


88.00 


61.27 


87.77 


F1 + F2 


87.23 


90.77 


65.40 


91.00 


F1 + F2 + F3 


88.30 


96.50 


71.71 


94.23 


F1 + F2 + F3 + F4 


90.10 


97.00 


71.93 


94.83 
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Table 9.4 P values of pairwise t tests on accuracy using different feature types 



Features 


C4.5 


SVM 


f Test results for English dataset n = 30 
FI VS.F1 + F2 


0.000*** 


0.000*** 


Fl+F2vs. F1+F2 + F3 


0.000*** 


0.000*** 


F1+F2 + F3 vs. F1+F2 + F3 + F4 


0.000*** 


0.1628 


f Test results for Arabic dataset n = 30 
FI VS.F1 + F2 


0.000*** 


0.000*** 


Fl+F2vs. F1+F2 + F3 


0.000*** 


0.000*** 


F1+F2 + F3 vs. F1+F2 + F3 + F4 


0.1216 


0.0224** 


**Significant with alpha = 0.05 
***Significant with alpha = 0.01 



Table 9.5 P values of pairwise t tests on 


accuracy using different classification techniques 


Technique/features 


FI 


F1+F2 


F1+F2 + F3 


FI +F2 + F3 + F4 


f Test results for English dataset n = 30 
C4.5 vs. SVM 


0.000*** 


0.000*** 


0.000*** 


0.000*** 


t Test results for Arabic dataset n = 30 
C4.5 vs. SVM 


0.000*** 


0.000*** 


0.000*** 


0.000*** 



***Significant with alpha = 0.01 



6.1 Comparison of Feature Types 

All feature categories improved classification accuracy in the stepwise analysis of 
features. Pairwise t tests were conducted to show the statistical significance of the 
additional feature types added. The results shown in Table 9.4 indicate that all 
feature types significantly improved accuracy for Arabic and English, except for 
content-specific words. This category of features was statistically insignificant in 
two situations (P=0. 1628, / J = (). 1 2 1 6) and significant at a lower alpha level in a 
third instance (P = 0.0224). The weaker performance of content-specific features 
could be attributable to their less prominent representation in the feature set in terms 
of number of features. There were only 1 1 and 15 content-specific features used in 
the English and Arabic feature sets, respectively. This number is far less than all 
other categories of features. Overall, the impact of the different feature types for 
Arabic was consistent with English results. 



6.2 Comparison of Classification Techniques 

Table 9.5 reveals that SVM significantly outperformed the decision tree classifier in 
all cases. This is consistent with previous studies that have shown SVM to be better 
equipped to handle larger feature sets and noisier data (characteristics associated 



7 Analysis of English and Arabic Group Models 



165 



with online authorship identification). The difference in accuracy between classifiers 
across Arabic was far greater than English: SVM outperformed C4.5 by over 20% 
on all feature set combinations. 



7 Analysis of English and Arabic Group Models 

We evaluated the important features for the two group forums based on decision tree 
analysis and overall feature usage. The analysis highlighted some of the key differ- 
ences between the language models and revealed some interesting trends pertaining 
to the English and Arabic groups. 



7.1 Decision Tree Analysis 

The C4.5 decision tree can be used as an effective analytical tool due to its descriptive 
nature. Decision trees can be visualized to look at the effect of individual features, since 
trees choose the features with the highest discriminatory power, measured in terms of 
entropy reduction. We analyzed the C4.5 trees for the English and Arabic group models 
and extracted a list of the important features based on decision tree outputs. 

Table 9.6 highlights the key differences between the English and Arabic models, 
based on the decision tree evaluations. For a particular group of features, the “Used” 
column indicates the number of features used, while the “Total” column refers to the 
total number of that feature type in the feature set. The “% Used” column indicates 
the percentage of that feature group incorporated by the decision tree and provides 
a good basis for comparing the KKK and Al-Aqsa Martyrs feature usage. 

The specific features integrated into the feature set for Arabic played an important 
role based on the decision tree analysis. Both elongation features and nearly half the 
word roots were deemed as vital attributes based on the C4.5 output, indicating that 
these are important Arabic characteristics that should be adopted in future studies. 
Furthermore, as expected, word length played a more critical role in the English 
KKK messages (40%) as compared to Arabic Al-Aqsa Martyrs messages (20%). 



Table 9.6 Summary of key features based on evaluation of decision trees 



Features 


English 






Arabic 






Used 


Total 


% Used 


Used 


Total 


% Used 


Elongation 


N/A 


N/A 


N/A 


2 


2 


100 


Word length 


8 


20 


40 


3 


15 


20 


Punctuation 


4 


8 


50 


7 


12 


58.33 


Function words 


31 


150 


20.67 


62 


200 


31 


Root words 


N/A 


N/A 


N/A 


22 


50 


44 


Word structure 


8 


14 


57.14 


8 


14 


57.14 


Technical structure 


12 


31 


38.71 


32 


48 


66.67 


Content-specific 


3 


11 


27.27 


3 


15 


20 
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Fig. 9.4 Comparison of group authorship characteristics 



The importance of punctuation, function words, and word-based structural features 
was fairly consistent across both languages. This suggests that syntactical and 
structural features are fairly robust feature categories that can be applied across 
languages. The largest disparity in terms of feature importance was in the technical 
structure category. The use of font size, color, hyperlinks, and embedded images 
was more important in classifying messages from the Al-Aqsa Martyrs. The preva- 
lence in usage of technical structure features for the Arabic group did not come as a 
complete surprise; however, the amount of such features used by the decision tree 
(66.7%) was beyond our expectations. 



7.2 Feature Usage Analysis 

In order to provide a more in-depth analysis of the differences between the KKK 
and Al-Aqsa Martyrs messages, a graph consisting of writing attributes common to 
the two groups was constructed. The visualization consisted of only lexical and 
structural features, since these feature groups are mostly language independent. 
Figure 9.4 shows the average usage by language for each of these attributes. 
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The values were normalized to a 0-1 scale in order to facilitate more accurate 
comparisons. Five major feature groups within the lexical and structural categories 
were identified, consisting of character lexical, word lexical, word length, word struc- 
ture, and technical structure. These groups were further decomposed into subgroups 
(e.g., paragraph structure) represented in either light gray or white. In addition to 
demonstrating obvious linguistic dissimilarities, the comparison revealed several 
interesting subtleties which may be attributable to group or cultural differences. 



7.2.1 Word/Character Lexical 

The word- and character-level lexical features showed that the Al-Aqsa messages 
tended to be considerably longer than the KKK messages. In addition to overall 
length, sentence lengths were longer as well for the Al-Aqsa Martyrs messages. 



7.2.2 Word Length 

Based on our data, midsized Arabic words in the 6-10-letter range were far more 
prevalent than English words. However, longer Arabic words (greater than length 
ten) were less common. This is consistent with previous research suggesting that 
Arabic has a narrower word-length distribution than English. 



7.2.3 Word Structure 

Overall, the Al-Aqsa messages had a more formal structure, with greater use of 
greetings, more sentences, more paragraphs, and lengthierparagraphs. Unsurprisingly, 
author contact information was not provided very often, but the KKK authors more 
commonly supplied e-mail addresses and phone numbers. Typically, the addresses 
and phone numbers provided belonged to groups/individuals disliked by the author. 



7.2.4 Technical Structure 

Al-Aqsa messages used a plethora of font colors and sizes, often using them as tools 
to emphasize a certain point. Red, blue, and navy were used almost as much as black. 
This was in sharp contrast to the KKK messages, where black 10-12-point fonts 
were a fixture, with the exception of the occasional deviation to green or blue. 

The Al-Aqsa messages had a far higher frequency of embedded images than the 
KKK messages (approximately 20 times more). The images were either photos or 
graphics represented by JPEGs and GIFs. The majority of the disparity in the use of 
embedded images was with respect to GIF and PNG file usage. The Al-Aqsa Martyrs 
forum messages frequently used GIFs to represent slogans and logos while there 
were no signs of this in the KKK messages. 
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The Al-Aqsa group’s messages also had far greater links to static, dynamic, and 
image pages. Links to multimedia files existed in both forums; however, such direct 
links were not very common. Some multimedia links were via web sites, thus clas- 
sified as web page links by the parser. 



7.2.5 Inferences 

Both forums consisted of messages that stated opinions and beliefs. However, the 
structure and dynamics of the two groups were quite different. The KKK forum 
messages were shorter and more conversational, implying greater familiarity 
between members. The Al-Aqsa group messages were more structured and formal 
and had a stronger persuasive inclination. The authors appeared to be making a 
concerted effort to state and justify their position by using a systematic and thor- 
ough writing approach. Bulleted points, paragraphs with headings, and generally 
longer message lengths, supported by embedded images and links, were the stan- 
dard structural theme for the Al-Aqsa messages. 



8 Conclusions and Future Directions 

In this research, we successfully applied authorship identification techniques for the 
classification of English and Arabic extremist group forum messages. In order to 
accomplish this task, we used techniques and features to overcome the challenges 
realized based on the linguistic properties of Arabic. All feature types incorporated 
(lexical, syntactic, structural, and content specific) showed significant discriminating 
power for Arabic and English, resulting in exceptional classification accuracy. 

With an established set of features and techniques for multilingual authorship 
analysis, we have several potential future directions. One of the limitations of cur- 
rent authorship identification methodologies is the number of authors that it can be 
applied to. In order to truly address the online anonymity problem, the techniques 
would require significant upward scalability to help discriminate between hundreds 
of potential authors. The development of more complex methodologies for differen- 
tiating between a larger set of authors is an important future endeavor. We also plan 
a more comprehensive analysis of English and Arabic extremist group authorship 
tendencies in order to distinguish group-level differences from linguistic disparities 
inherent between English and Arabic. For example, do the “persuasive” tendencies 
observed regarding the Al-Aqsa Martyrs messages have broader applicability to 
other extremist Arabic groups? Furthermore, what role does geographic proximity 
and time play on group and individual authorship characteristics? Evaluation of 
these questions could prove to be an interesting venture. 
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Chapter 10 

Sentiment Analysis 



1 Introduction 

Analysis of Web content is becoming increasingly important due to augmented 
communication via computer-mediated communication (CMC) Internet sources 
such as e-mail, Web sites, forums, and chat rooms. The numerous benefits of the 
Internet and CMC have been coupled with the realization of some vices, including 
cybercrime. In addition to misuse in the form of deception, identity theft, and the 
sales and distribution of pirated software, the Internet has also become a popular 
communication medium and haven for extremist and hate groups. This problematic 
facet of the Internet is often referred to as the Dark Web (Chen 2006). 

Stormfront, what many consider to be the first hate group Web site (Kaplan and 
Weinberg 1998), was created around 1996. Since then, researchers and hate watch 
organizations have begun to focus their attention toward studying and monitoring 
such online groups (Leets 2001). Despite the increased focus on analysis of such 
group’s Web content, there has been limited evaluation of forum postings, with the 
majority of studies focusing on Web sites. Burris et al. (2000) acknowledged that 
there was a need to evaluate forum and chat room discussion content. Schafer (2002) 
also stated that it was unclear as to how much and what kind of forum activity was 
going on with respect to hateful cyber activist groups. Due to the lack of under- 
standing and current ambiguity associated with the content of such groups’ forum 
postings, analysis of extremist group forum archives is an important endeavor. 

Sentiment analysis attempts to identify and analyze opinions and emotions. 
Hearst (1992) and Wiebe (1994) originally proposed the idea of mining direction- 
based text, i.e., text containing opinions, sentiments, affects, and biases. Traditional 
forms of content analysis, such as topical analysis, may not be effective for forums. 
Nigam and Hurst (2004) found that only 3% of Usenet sentences contained topical 
information. In contrast, Web discourse is rich in sentiment-related information 
(Subasic and Huettner 2001). Consequently, in recent years, sentiment analysis 
has been applied to various forms of Web-based discourse (Agrawal et al. 2003; 
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Efron 2004). Application to extremist group forums can provide insight into significant 
discussions and trends. 

In this study, we propose the application of sentiment analysis techniques to 
hate/extremist group’s forum postings. Our analysis encompasses classification of 
sentiments on two forums: a US supremacist and Middle Eastern extremist group. 
The remainder of this chapter is organized as follows. Section 2 presents a review of 
current research on sentiment classification. Section 3 describes research gaps and 
questions, while Sect. 4 presents our research design. Section 5 describes the EWGA 
algorithm and our proposed feature set. Section 6 presents experiments used to eval- 
uate the effectiveness of the proposed approach and discussion of the results. 
Section 7 concludes with closing remarks and future directions. 



2 Related Work 

Extremist groups often use the Internet to promote hatred and violence (Glaser et al. 
2002). The Internet offers a ubiquitous, quick, inexpensive, and anonymous means 
of communication for such groups (Crilley 2001). Zhou et al. (2005) did an in-depth 
analysis of US hate group Web sites and found significant evidence of fundraising, 
propaganda, and recruitment-related content. Abbasi and Chen (2005) also corrobo- 
rated signs of Web usage as a medium for propaganda by US supremacist and 
Middle Eastern extremist groups. These findings provide insight into extremist 
group Web usage tendencies; however, there has been little analysis of Web forums. 
Burris et al. (2000) acknowledged the need to evaluate forum and chat room discus- 
sion content. Schafer (2002) was also unclear as to how much and what kind of 
forum activity was going on with respect to extremist groups. Automated analysis 
of Web forums can be an arduous endeavor due to the large volumes of noisy infor- 
mation contained in CMC archives. Consequently, previous studies have predomi- 
nantly incorporated manual or semiautomated methods (Zhou et al. 2005). Manual 
examination of thousands of messages can be an extremely tedious effort when 
applied across thousands of forum postings. With increasing usage of CMC, the 
need for automated text classification and analysis techniques has grown in recent 
years. While numerous forms of text classification exist, we focus primarily on 
sentiment analysis for two reasons. First, Web discourse is rich in opinion and emo- 
tion-related content. Second, analysis of this type of text is highly relevant to propa- 
ganda usage on the Web since directional/opinionated text plays an important role 
in influencing people’s perceptions and decision making (Picard 1997). 



2.1 Sentiment Classification 

Sentiment analysis is concerned with analysis of direction-based text, i.e., text con- 
taining opinions and emotions. We focus on sentiment classification studies which 
attempt to determine whether a text is objective or subjective, or whether a subjective 
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Table 10.1 A taxonomy of sentiment polarity classification 



Tasks 

Category 


Description 


Label 


Classes 


Positive/negative sentiments or objective/subjective texts 


Cl 


Level 


Document- or sentence-/phrase-level classification 


C2 


Source 


Whether source/target of sentiment is known or extracted 


C3 


Features 

Category 


Examples 


Label 


Syntactic 


Word/POS tag n-grams, phrase patterns, punctuation 


FI 


Semantic 


Polarity tags, appraisal groups, semantic orientation 


F2 


Link-based 


Web links, send/reply patterns, and document citations 


F3 


Stylistic 


Lexical and structural measures of style 


F4 


Techniques 

Category 


Examples 


Label 


Machine learning 


Techniques such as SVM, naive Bayes, etc. 


T1 


Link analysis 


Citation analysis and message send/reply patterns 


T2 


Similarity score 


Phrase pattern matching, frequency counts, etc. 


T3 


Domains 

Category 


Description 


Label 


Reviews 


Product, movie, and music reviews 


D1 


Web discourse 


Web forums and blogs 


D2 


News articles 


Online news articles and Web pages 


D3 



text contains positive or negative sentiments. Sentiment classification has several 
important characteristics including the various tasks, features, techniques, and appli- 
cation domains. These are summarized in the taxonomy presented in Table 10.1. 

We are concerned with classifying sentiments in extremist group forums. Based 
on the proposed taxonomy, Table 10.2 shows selected previous studies dealing with 
sentiment classification. We discuss the taxonomy and related studies in detail 
below. 



2.2 Sentiment Analysis Tasks 

There have been several sentiment polarity classification tasks. Three important 
characteristics of the various sentiment polarity classification tasks are the classes, 
classification levels, and assumptions about sentiment source and target (topic). The 
common two-class problem involves classifying sentiments as positive or negative 
(Pang et al. 2002; Turney 2002). Additional variations include classifying messages 
as opinionated/subjective or factual/objective (Wiebe et al. 2001, 2004). A closely 
related problem is affect classification which attempts to classify emotions instead 
of sentiments. Examples of affect classes include happiness, sadness, anger, horror, 
etc. (Subasic and Huettner 2001; Grefenstette et al. 2004; Mishne 2005). 

Sentiment polarity classification can be conducted at the document, sentence, or 
phrase (part of sentence) level. Document-level polarity categorization attempts to 
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Table 10.2 Selected previous studies in sentiment polarity classification 



Study 


Features 






Reduce 

features 


Techniques 


Domains 




Number of 
languages 




FI 


F2 


F3 


F4 


Yes/No 


T1 


T2 


T3 


D1 


D2 


D3 


1-n 


Subasic and Huettner 


✓ 


✓ 






No 






✓ 






✓ 


1 


2001 


























Tong 2001 


✓ 


✓ 






No 






✓ 


✓ 






1 


Morinaga et al. 2002 


✓ 








Yes 






✓ 


✓ 






1 


Pang et al. 2002 


✓ 








No 


✓ 






✓ 






1 


Turney 2002 


✓ 








No 






✓ 


✓ 






1 


Agrawal et al. 2003 


✓ 




✓ 




No 


✓ 


S 






✓ 




1 


Dave et al. 2003 


✓ 








No 


V 




✓ 


✓ 






1 


Nasukawa and Yi 2003 


✓ 








No 






✓ 


✓ 






1 


Riloff et al. 2003 




V 




✓ 


No 


V 










✓ 


1 


Yi et al. 2003 


✓ 


V 






Yes 






✓ 


✓ 




y 


1 


Yu and Hatzivassiloglou 


✓ 


V 






No 


V 




✓ 






V 


1 


2003 


























Beineke et al. 2004 




V 






No 


V 




✓ 


✓ 






1 


Efron 2004 


✓ 




V 




No 


V 


V 






✓ 




1 


Fei et al. 2004 




V 






No 






✓ 


✓ 






1 


Gamon 2004 


✓ 






✓ 


Yes 


y 






✓ 






1 


Grefenstette et al. 2004 


✓ 


■/ 






No 






✓ 




✓ 




1 


Hu and Liu 2004 


✓ 


■/ 






No 






✓ 


✓ 






1 


Kanayama et al. 2004 


✓ 


■/ 






No 






✓ 


✓ 






1 


Kim and Hovy 2004 




■/ 






No 






✓ 




✓ 




1 


Pang and Lee 2004 


✓ 








No 


V 




✓ 


✓ 






1 


Mullen and Collier 2004 


✓ 


✓ 






No 


V 






✓ 






1 


Nigam and Hurst 2004 


✓ 


✓ 






No 


V 








✓ 




1 


Wiebe et al. 2004 


✓ 






✓ 


Yes 


V 




✓ 






V 


1 


Liu et al. 2005 


✓ 


✓ 






No 






✓ 


✓ 






1 


Mishne 2005 


✓ 


✓ 




✓ 


No 


V 








✓ 




1 


Whitelaw et al. 2005 


✓ 


✓ 






No 


V 






✓ 






1 


Wilson et al. 2005 


✓ 


✓ 






No 


V 










V 


1 



classify sentiments in movie reviews, news articles, or Web forum postings (Wiebe 
et al. 2001 ; Pang et al. 2002; Mullen and Collier 2004; Pang and Lee 2004; Whitelaw 
et al. 2005). Sentence-level polarity categorization attempts to classify positive and 
negative sentiments for each sentence (Yi et al. 2003; Mullen and Collier 2004; Pang 
and Lee 2004) or whether a sentence is subjective or objective (Riloff et al. 2003). 
There has also been work on phrase-level categorization in order to capture multiple 
sentiments that may be present within a single sentence (Wilson et al. 2005). 

In addition to sentiment classes and categorization levels, different assumptions 
have also been made about the sentiment sources and targets (Yi et al. 2003). In this 
study, we focus on document-level sentiment polarity categorization (i.e., distin- 
guishing positive and negative sentiment texts). However, we also review related 
sentence-level and subjectivity classification studies due to the relevance of the fea- 
tures and techniques utilized and the application domains. 
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2.3 Sentiment Analysis Features 

There are four feature categories that have been used in previous sentiment analysis 
studies. These include syntactic, semantic, link-based, and stylistic features. Along 
with semantic features, syntactic attributes are the most commonly used set of fea- 
tures for sentiment analysis. These include word n-grams (Pang et al. 2002; Gamon 
2004), part-of-speech (POS) tags (Pang et al. 2002; Yi et al. 2003; Gamon 2004), 
and punctuation. Additional syntactic features include phrase patterns, which make 
use of POS tag n-gram patterns (Nasukawa and Yi 2003; Yi et al. 2003; Fei et al. 
2004). They noted that phrase patterns such as “n + aj” (noun followed by positive 
adjective) typically represented positive sentiment orientation while “n + dj” (noun 
followed by negative adjective) often expressed negative sentiment (Fei et al. 2004). 
Wiebe et al. (2004) used collocations, where certain parts of fixed word n-grams 
were replaced with general word tags, thereby also creating n-gram phrase patterns. 
For example, the pattern “U-adj as-prep” would be used to signify all bigrams con- 
taining a unique (once occurring) adjective followed by the preposition “as.” 
Whitelaw et al. (2005) used a set of modifier features (e.g., very, mostly, not); the 
presence of these features transformed appraisal attributes for lexicon items. 

Semantic features incorporate manual/semiautomatic or fully automatic annota- 
tion techniques to add polarity or affect intensity-related scores to words and 
phrases. Hatzivassiloglou and McKeown (1997) proposed a semantic orientation 
(SO) method later extended by Turney (2002) that uses a mutual information calcu- 
lation to automatically compute the SO score for each word/phrase. The score is 
computed by taking the mutual information between a phrase and the word “excel- 
lent” and subtracting the mutual information between the same phrase and the word 
“poor.” In addition to pointwise mutual information, the SO approach was later also 
evaluated using latent semantic analysis (Turney and Littman 2003). 

Manually or semiautomatically generated sentiment lexicons (e.g., Tong 2001; 
Fei et al. 2004; Wilson et al. 2005) typically use an initial set of automatically gen- 
erated terms which are manually filtered and coded with polarity and intensity 
information. The user-defined tags are incorporated to indicate whether certain 
phrases convey positive or negative sentiment. Riloff et al. (2003) used semiauto- 
matic lexicon generation tools to construct sets of strong subjectivity, weak subjec- 
tivity, and objective nouns. Their approach outperformed the use of other features, 
including bag-of-words, for classification of objective versus subjective English 
documents. Appraisal groups (Whitelaw et al. 2005) are another effective method 
for annotating semantics to words/phrases. Initial term lists are generated using 
WordNet, which are then filtered manually to construct the lexicon. Developed 
based on appraisal theory (Martin and White 2005), each expression is manually 
classified into various appraisal classes. These classes include attitude, orientation, 
graduation, and polarity of phrases. Whitelaw et al. (2005) were able to get very 
good accuracy using appraisal groups on a movie review corpus, outperforming 
several previous studies (e.g., Mullen and Collier 2004), the automated mutual- 
information-based approach (Turney 2002), as well as the use of syntactic features 
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(Pang et al. 2002). Manually crafted lexicons have also been used for affect analysis. 
Subasic and Huettner (2001) used affect lexicons along with fuzzy semantic typing 
for affect analysis of news articles and movie reviews. Abbasi and Chen (2007, 
2008) used manually constructed affect lexicons for analysis of hate and violence in 
extremist Web forums. 

Other semantic attributes include contextual features representing the semantic 
orientation of surrounding text, which have been useful for sentence-level sentiment 
classification. Riloff et al. (2003) utilized semantic features that considered the sub- 
jectivity and objectivity of text surrounding a sentence. Their attributes measured 
the level of subjective and objective clues in the sentence prior to and following the 
sentence of interest. Pang and Lee (2004) also leveraged coherence in discourse by 
considering the level of subjectivity of sentences in close proximity to the sentence 
of interest. 

Link-based features use link/citation analysis to determine sentiments for Web 
artifacts and documents. Efron (2004) found that opinion Web pages heavily link- 
ing to each other often shared similar sentiments. Agrawal et al. (2003) observed 
the exact opposite for Usenet newsgroups discussing issues such as abortion and 
gun control. They noticed that forum replies tended to be antagonistic. Due to the 
limited usage of link-based features, it is unclear how effective they may be for 
sentiment classification. Furthermore, unlike Web pages and Usenet, other forums 
may not have a clear message link structure, and some forums are serial (no 
threads). 

Stylistic attributes include lexical and structural attributes incorporated in numer- 
ous prior stylometric/authorship studies (e.g., De Vel et al. 2001; Zheng et al. 2006). 
However, lexical and structural style markers have seen limited usage in sentiment 
analysis research. Wiebe et al. (2004) used hapax legomena (unique/once occurring 
words) effectively for subjectivity and opinion discrimination. They observed a 
noticeably higher presence of unique words in subjective texts as compared to 
objective documents across a Wall Street Journal corpus and noted “Apparently, 
people are creative when they are being opinionated” (p. 286). Gamon (2004) used 
lexical features such as sentence length for sentiment classification of feedback sur- 
veys. Mishne (2005) used lexical style markers such as words per message and 
words per sentence for affect analysis of Web blogs. While it is unclear whether 
stylistic features are effective sentiment discriminators for movie/product reviews, 
style markers have been shown to be highly prevalent in Web discourse (Abbasi and 
Chen 2005; Zheng et al. 2006; Schler et al. 2006). 



2.4 Sentiment Classification Techniques 

Previously used techniques for sentiment classification can be classified into three 
categories. These include machine learning algorithms, link analysis methods, and 
score-based approaches. 
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Many studies have used machine learning algorithms, with support vector machines 
(SVM) and naive Bayes (NB) being the most commonly used. SVM has been used 
extensively for movie reviews (Pang et al. 2002; Pang and Lee 2004; Whitelaw et al. 
2005) while naive Bayes has been applied to reviews and Web discourse (Pang et al. 
2002; Pang and Lee 2004; Efron 2004). In comparisons, SVM has outperformed 
other classifiers such as NB (Pang et al. 2002). While SVM has become a dominant 
technique for text classification, other algorithms such as Winnow (Nigam and Hurst 
2004) and AdaBoost (Wilson et al. 2005) have also been used in previous sentiment 
classification studies. 

Studies using link-based features and metrics for sentiment classification have 
often used link analysis. Efron (2004) used cocitation analysis for sentiment clas- 
sification of Web site opinions while Agrawal et al. (2003) used message reply link 
structures to classify sentiments in Usenet newsgroups. An obvious limitation of 
link analysis methods is that they are not effective where link structure is not clear 
or links are sparse (Efron 2004). 

Score-based methods are typically used in conjunction with semantic features. 
These techniques generally classify message sentiments based on the total sum of 
comprised positive or negative sentiment features. Phrase pattern matching 
(Nasukawa and Yi 2003; Yi et al. 2003; Fei et al. 2004) requires checking text for 
manually created polarized phrase tags (positive and negative). Positive phrases are 
assigned a plus one while negative phrases are assigned a minus one. All messages 
with a positive sum are assigned to the positive sentiment while negative messages 
are assigned to the negative sentiment class. The semantic orientation approach 
(Hatzivassiloglou and McKeown 1997; Turney 2002) uses a similar method to score 
the automatically generated polarized phrase tags. Score-based methods have also 
been used for affect analysis where the affect features present within a message/ 
document are scored based on their degree of intensity for a particular emotion class 
(Subasic and Huettner 2001). 



2.5 Sentiment Analysis Domains 

Sentiment analysis has been applied to numerous domains including reviews, Web 
discourse, and news articles and documents. Reviews include movie, product, and 
music reviews (Morinaga et al. 2002; Pang et al. 2002; Turney 2002). Sentiment 
analysis of movie reviews is considered to be very challenging since movie review- 
ers often present lengthy plot summaries and also use complex literary devices such 
as rhetoric and sarcasm. Product reviews are also fairly complex since a single 
review can feature positive and negative sentiments about particular facets of the 
product. 

Web discourse sentiment analysis includes evaluation of Web forums, news- 
groups, and blogs. These studies assess sentiments about specific issues/topics. 
Sentiment topics include abortion, gun control, and politics (Agrawal et al. 2003; 
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Efron 2004). Robinson (2005) evaluated sentiments about 9/11 in three forums in 
the United States, Brazil, and France. Wiebe et al. (2004) performed subjectivity 
classification of Usenet newsgroup postings. 

Sentiment analysis has also been applied to news articles (Yi et al. 2003; Wilson 
et al. 2005). Henley et al. (2002) analyzed newspaper articles for biases pertaining 
to violence-related reports. They found that there was a significant difference 
between the manner in which the Washington Post and the San Francisco Chronicle 
reported news stories relating to antigay attacks, with the reporting style reflecting 
newspaper sentiments. Wiebe et al. (2004) classified objective and subjective news 
articles in a Wall Street Journal corpus. 

Some general conclusions can be drawn from Table 10.2 and the literature review. 
Most studies have used syntactic and semantic features. There has also been little 
use of feature reduction/selection techniques which may improve classification 
accuracy. In addition, most previous studies have focused on English data, predomi- 
nantly in the review domain. 



3 Research Gaps and Questions 

Based on our review of previous literature and conclusions, we have identified several 
important research gaps. First, there has been limited previous sentiment analysis 
work on Web forums, and most studies have focused on a sentiment classification of a 
single language. Second, there has been almost no usage of stylistic feature categories. 
Finally, little emphasis has been placed on feature reduction/selection techniques. 



3.1 Web Forums in Multiple Languages 

Most previous sentiment classification of Web discourse has focused on Usenet and 
financial forums. Applying such methods to extremist forums is important in order 
to develop a viable set of features for assessing the presence of propaganda, anger, 
and hate in these online communities. Furthermore, there has been little evaluation 
on non-English content, with the exception of Kanayama et al. (2004) performing 
sentiment classification on Japanese text. Even in that study, machine translation 
software was used to convert the text to English. Thus, multiple language features 
have not been used for sentiment classification. The globalized nature of the Internet 
necessitates more sentiment analysis across languages. 



3.2 Stylistic Features 

Previous work has focused on syntactic and semantic features. There has been little 
use of stylistic features such as word-length distributions, vocabulary richness mea- 
sures, character- and word-level lexical features, and special character frequencies. 
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Gamon (2004) and Pang et al. (2002) pointed out that many important features may 
not seem intuitively obvious at first. Thus, while prior emphasis has been on adjec- 
tives, stylistic features may uncover latent patterns that can improve classification 
performance of sentiments. This may be especially true for Web forum discourse, 
which is rich in stylistic variation (Abbasi and Chen 2005; Zheng et al. 2006). Stylistic 
features have also been shown to be highly prevalent in other forms of computer- 
mediated communication, including Web blogs (Herring and Paolillo 2006). 



3.3 Feature Reduction for Sentiment Classification 

Different automated and manual approaches have been used to craft sentiment clas- 
sification feature sets. Little emphasis has been given to feature subset selection 
techniques. Gamon (2004) and Yi et al. (2003) used log likelihood to select impor- 
tant attributes from a large initial feature space. Wiebe et al. (2004) evaluated the 
effectiveness of various potential subjective elements (PSEs) for subjectivity clas- 
sification based on their occurrence distribution across classes. However, many 
powerful techniques have not been explored. Feature reduction/selection techniques 
have two important benefits (Li et al. 2006). They can potentially improve classifica- 
tion accuracy and also provide greater insight into important class attributes, result- 
ing in a better understanding of sentiment arguments and characteristics (Guyon and 
Elisseeff 2003). Using feature reduction, Gamon (2004) was able to improve accu- 
racy and narrow in on a key feature subset of sentiment discriminators. 



3.4 Research Questions 

We propose the following research questions. 

1 . Can sentiment analysis be applied to Web forums in multiple languages? 

2. Can stylistic features provide further sentiment insight and classification 
power? 

3. How can feature selection improve classification accuracy and identify key senti- 
ment attributes? 



4 Research Design 

In order to address these questions, we propose the use of a sentiment classification 
feature set consisting of syntactic and stylistic features. Furthermore, utilization of 
feature selection techniques such as genetic algorithms (Holland 1975) and informa- 
tion gain (Shannon 1948; Quinlan 1986) is also included to improve classification 
accuracy and gain insight into the important features for each sentiment class. Based 
on the prevalence of stylistic variation in Web discourse, we believe that lexical and 
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Table 10.3 Text classification studies using GA, IG, and SVM weights 



Technique 


Task 


Study 


GA 


Stylometric analysis 


Li et al. 2006 


IG 


Topic classification 


Efron et al., 2003 




Stylometric analysis 


Juola and Baayen, 2005 
Koppel and Schler 2003 
Abbasi and Chen 2006 


SVM weights 


Topic classification 


Mladenic et al. 2004 




Gender categorization 


Koppel et al. 2002 



structural style markers can improve the ability to classify Web forum sentiments. 
Integrated stylistic features include attributes such as word-length distributions, 
vocabulary richness measures, letter usage frequencies, use of greetings, presence of 
requoted content, use of URLs, etc. 

We also propose the use of an entropy weighted genetic algorithm (EWGA) that 
incorporates the information gain (IG) heuristic with a genetic algorithm (GA) to 
improve feature selection performance. GA is an evolutionary computing search 
method (Holland 1975) that has been used in numerous feature selection applica- 
tions (Siedlecki and Sklansky 1989; Yang and Honavar 1998; Li et al. 2006; 2007). 
Oliveira et al. (2002) successfully applied GA to feature selection for handwritten 
digit recognition. Vafaie and Imam (1994) showed that GA outperformed other heu- 
ristics such as greedy search for image recognition feature selection. Like most 
random search feature selection methods (Dash and Liu 1997), it uses a wrapper 
model where the performance accuracy is used as the evaluation criterion to improve 
the feature subset in future generations. 

In contrast, IG is a heuristic based on information theory (Shannon 1 948). It uses 
a filter model for ranking features which makes it computationally more efficient 
than GA. IG has outperformed numerous feature selection techniques in head-to- 
head comparisons (Forman 2003). Since our experiments will use the SVM classi- 
fier, we also plan to compare the proposed EWGA technique against the use of 
SVM weights for feature selection. In this method, the SVM weights are used to 
iteratively reduce the feature space, thereby improving performance (Koppel et al. 
2002). SVM weights have been shown to be effective for text categorization 
(Koppel et al. 2002; Mladenic et al. 2004) and gene selection for cancer classifica- 
tion (Guyon et al. 2002). GA, IG, and SVM weights have been used in several 
previous text classification studies as shown in Table 10.3. A review of feature 
selection for text classification can be found in Sebastiani (2002). 

A consequence of using an optimal search method such as GA in a wrapper 
model is that convergence toward an ideal solution can be slow when dealing with 
very large solution spaces. However, as previous researchers have argued, feature 
selection is considered an “offline” task that does not need to be repeated constantly 
(Jain and Zongker 1997). This is why wrapper-based techniques using genetic algo- 
rithms have been used for gene selection with feature spaces consisting of tens of 
thousands of genes (Li et al. 2007). Furthermore, hybrid GAs have previously been 
used for product design optimization (Alexouda and Papparrizos 2001 ; Balakrishnan 
et al. 2004) and scheduling problems (Levine 1996) to facilitate improved accuracy 
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and convergence efficiency (Balakrishnan et al. 2004). We developed the EWGA 
hybrid GA that utilizes the information gain (IG) heuristic with the intention of 
improving feature selection quality. More algorithmic details are provided in the 
next section. 



5 System Design 



We propose the following system design (shown in Fig. 10.1). Our design has two 
major steps: extracting an initial set of features and performing feature selection. 
These steps are used to carry out sentiment classification of forum messages. 



5.1 Feature Extraction 

We incorporated syntactic and stylistic features in our sentiment classification attri- 
bute set. These features are more generic and applicable across languages. For 
instance, syntactic, lexical, and structural features have been successfully used in 




Fig. 10.1 Sentiment classification system design 
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stylometric analysis studies applied to English, Chinese (Peng et al. 2003; Zheng 
et al. 2006), Greek (Stamatatos et al. 2003), and Arabic (Abbasi and Chen 2005; 
2006). Link-based features were not included since our messages were not in 
sequential order (insufficient cross-message references). These types of features are 
only effective where the test bed consists of entire threads of messages and mes- 
sage-referencing information is available. Semantic features were not used since 
these attributes are heavily context-dependent (Pang et al. 2002). Such features are 
topic and language specific. For example, the set of positive polarity words describ- 
ing a good movie may not be applicable to discussions about racism. Unlike stylis- 
tic and syntactic features, semantic features such as manually crafted lexicons 
incorporate an inherent feature selection element via the human involvement. Such 
human involvement makes semantic features (e.g., lexicons and dictionaries) very 
powerful for sentiment analysis. Lexicon developers will only include features that 
are considered to be important and weight these features based on their significance, 
thereby reducing the need for feature selection. For example, Whitelaw et al. (2005) 
used WordNet to construct an initial set of features, which were manually filtered 
and weighted to create the lexicon. Unfortunately, the language specificity of seman- 
tic features is particularly problematic for application to the Dark Web, which con- 
tains text in dozens of languages (Chen 2006). We hope to overcome the lack of 
semantic features by incorporating feature selection methods intended to isolate the 
important subset of stylistic and syntactic features and remove noise. 



5.1.1 Determining Size of Initial Feature Set 

Our initial feature set consisted of 14 different feature categories which included 
POS tag n-grams (for English), word roots (for Arabic), word n-grams, and punctua- 
tion for syntactic features. Style markers included word- and character-level lexical 
features, word-length distributions, special characters, letters, character n-grams, 
structural features, vocabulary richness measures, digit n-grams, and function words. 
The word-length distribution includes the frequency of 1-20 letter words. Word- 
level lexical features include total words per document, average word length, average 
number of words per sentence, average number of words per paragraph, total number 
of short words (i.e., ones less than four letters), etc. Character-level lexical features 
include total characters per document, average number of characters per sentence, 
average number of characters per paragraph, percentage of all characters that are in 
words, and the percentage of alphabetic, digit, and space characters. Vocabulary 
richness features include the total number of unique words used, hapax legomena 
(number of once occurring words), dis legomena (number of twice occurring words), 
and various previously defined statistical measures of richness such as Yule’s K, 
Flonore’s R, Sichel’s S, Simpson’s D, and Brunet’s W measures. The structural fea- 
tures encompass the total number of lines, sentences, and paragraphs, as well as 
whether the document has a greeting or a signature. Additional structural attributes 
include whether there is a separation between paragraphs, whether the paragraphs 
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Table 10.4 English and Arabic feature sets 

Category Feature group English Arabic Examples 



Syntactic 


POS n-grams 


Varies 


- 




Word roots 


- 


Varies 




Word n-grams 


Varies 


Varies 




Punctuation 


8 


12 


Stylistic 


Letter n-grams 


26 


36 




Char, n-grams 


Varies 


Varies 




Word lexical 


8 


8 




Char, lexical 


8 


8 




Word length 


20 


20 




Vocab. richness 


8 


8 




Special char. 


20 


21 




Digit n-grams 


Varies 


Varies 




Structural 


14 


14 




Function words 


250 


200 



Frequency of part-of-speech tags (e.g., NP_VB) 
Frequency of roots (e.g., slm, ktb) 

Word n-grams (e.g., senior editor, editor in chief) 
Occurrence of punctuation marks (e.g., !,;, :, and ?) 
Frequency of letters (e.g., a, b, c, etc.) 

Character n-grams (e.g., abo, out, ut, ab, etc.) 
Total words,% char, per word 
Total char.,% char, per message 
Frequency distribution of 1-20 letter words 
Richness (e.g., hapax legomena and Yule’s K) 
Occurrence of special char, (e.g., @, #, $,%, A , 

&, *, and +) 

Frequency of digits (e.g.. 100, 17, and 5) 

Has greeting, has URL, requoted content, etc. 
Frequency of function words (e.g., of, for, and to) 



are indented, the presence and position of quoted and forwarded content, and whether 
the document includes e-mail, URL, and telephone contact information. Further 
descriptions of the lexical vocabulary richness and structural attributes can be found 
in De Vel et al. (2001), Zheng et al. (2006), and Abbasi and Chen (2005). The Arabic 
function words were Arabic words translated from the English function word list, as 
done in previous research (e.g., Chen and Gey 2002). Only words were considered; 
for convenience, no affixes were included. 

Many feature categories are predefined in terms of the number of potential fea- 
tures. For example, there are only a certain number of possible punctuation and 
stylistic lexical features (e.g., words per sentence, words per paragraph, etc.). In 
contrast, there are countless potential n-gram-based features. Consequently, some 
shallow selection criterion is typically incorporated to reduce the feature space for 
n-grams. A common approach is to select features with a minimum usage frequency 
(Mitra et al. 1997; Jiang et al. 2004). We used a minimum frequency threshold of 10 
for n-gram-based features. Less common features are sparse and likely to cause 
overfitting. In addition, we only used unigrams, bigrams, and trigrams as these 
higher level n-grams tend to be redundant. Using only up to trigrams has been shown 
to be effective for stylometric analysis (Kjell et al. 1994) and sentiment classification 
(Pang et al. 2002; Wiebe et al. 2004). Based on this criterion for n-gram features, 
Table 10.4 shows the English and Arabic feature sets. 



5.1.2 Feature Extraction Component 

Due to the challenging morphological characteristics of Arabic, our attribute extrac- 
tion process features a component for tracking elongation as well as a root extrac- 
tion algorithm (illustrated in Fig. 10.2). 
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Fig. 10.2 Arabic extraction component 



Elongation is the process of using a dash-like “kashida” character for stylistic 
word stretching (shown in step 1 in Fig. 10.2). The use of elongation is very prevalent 
in Arabic Web forum discourse (Abbasi and Chen 2005). In addition to tracking the 
presence and extent of elongation, we filter out these “kashida” characters in order to 
ensure reliable extraction of the remaining features (step 2 in Fig. 10.2). The filtered 
words are then passed through a root extraction algorithm (Abbasi and Chen 2005) 
that compares each word against a root dictionary to determine the appropriate word- 
root match (step 3). Root frequencies are tracked in order to account for the highly 
inflective nature of Arabic which reduces the effectiveness of standard bag-of-words 
features. The remaining stylistic and syntactic features are then extracted in a similar 
manner for English and Arabic (step 4). 



5.2 Feature Selection: Entropy Weighted Genetic 
Algorithm (EWGA) 

Most previous hybrid GA variations combine GA with other search heuristics such as 
beam search, where the beam search output is used as part of the initial GA population 
(Alexouda and Papparrizos 2001; Balakrishnan et al. 2004). Additional hybridiza- 
tions include modification of the GA’s crossover (Aggarwal et al. 1997) and mutation 
operators (Balakrishnan et al. 2004). The entropy weighted genetic algorithm (EWGA) 
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Fig. 10.3 EWGA illustration 




uses the information gain (IG) heuristic to weight the various sentiment attributes. 
These weights are then incorporated into the GA’s initial population and crossover 
and mutation operators. The major steps for the EWGA are as follows: 



EWGA Steps 

1 . Derive feature weights using IG. 

2. Include IG selected features as part of initial GA solution population. 

3. Evaluate and select solutions based on fitness function. 

4. Crossover solution pairs at point that maximizes total IG difference 
between the two solutions. 

5. Mutate solutions based on feature IG weights. 

6. Repeat steps 3-5 until stopping criterion is satisfied. 



Figure 10.3 shows an illustration of the EWGA process. A detailed description of 
the IG, initial population, evaluation and selection, crossover, and mutation steps is 
presented in Fig. 10.3. 
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5.2.1 Information Gain 

For information gain (IG), we used the Shannon entropy measure (Shannon 1948) 
in which: 



IG(C,A)=H(C)-H(C| A) 



where: 

7G(C, A) information gain for feature A; 

77(C) = — ^ p(C = i) log 2 piC = i) entropy across sentiment classes C; 

i=l 

II (C I A) = p(C = i I A)log 2 p(C = i I A) specific feature conditional entropy; 
n i=1 total number of sentiment classes. 

If the number of positive and negative sentiment messages is equal, 77(C) is 1. 
Furthermore, the information gain for each attribute A will vary along the range 0-1 
with higher values indicating greater information gain. All features with an infor- 
mation gain greater than 0.0025 (i.e., IG(C,A)> 0.0025) are selected. The use of 
such a threshold is consistent with prior work using IG for text feature selection 
(Yang and Pederson 1997). 



5.2.2 Solution Structure and Initial Population 

We represent each solution in the population using a binary string of length equal to 
the total number of features, with each binary string character representing a single 
feature. Specifically, 1 represents a selected feature while 0 represents a discarded 
one. For example, a solution string representing five candidate features, “10011,” 
means that the first, fourth, and fifth features are selected, while the other two are 
discarded (Li et al. 2006). In the standard GA, the initial population of n strings is 
randomly generated. In the EWGA, n-1 solution strings are randomly generated 
while the IG solution features are used as the final solution string in the initial 
population. 



5.2.3 Evaluation and Selection 

We use the classification accuracy as the fitness function used to evaluate the quality 
of each solution. Hence, for each genome in the population, tenfold cross-validation 
with SVM is used to assess the fitness of that particular solution. Solutions for the 
next iteration are selected probabilistically with better solutions having a higher prob- 
ability of selection. While several population replacement strategies exist, we use the 
generational replacement method originally defined by Holland (1975) in which the 
entire population is replaced every generation. Other replacement alternatives include 
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steady-state methods where only a fraction of the population is replaced every 
iteration, while the majority is passed over to the next generation (Levine 1996). 
Generational replacement is used in order to maintain solution diversity and prevent 
premature convergence attributable to the IG seed solution dominating the other 
solutions (Bentley 1990; Aggarwal et al. 1997; Balakrishnan et al. 2004). 



5.2.4 Crossover 

From the n solution strings in the population (i.e., n/2 pairs), certain adjacent string 
pairs are randomly selected for crossover based on a crossover probability P . In 
the standard GA, we use single-point crossover by selecting a pair of strings and 
swapping substrings at a randomly determined crossover point x. 



S = 010010 


► x = 3 — 


5= 010 | 010 

► — 


S= 010100 

— ► 


T= 110100 




r = 1 10 | ioo 


T= 110010 



The IG heuristic is utilized in the EWGA crossover procedure in order to improve 
the quality of the newly generated solutions. Given a pair of solution strings S and 
T , the EWGA crossover method selects a crossover point x that maximizes the dif- 
ference in cumulative information gain across strings S and T. Such an approach is 
intended to create a more diverse solution population: those with heavier concentra- 
tions of features with higher IG values and those with fewer IG features. The cross- 
over point selection procedure can be formulated as follows: 



arg max 



£/G(C,A)(S a 



rj+jr/G(C,A)(r A -s A ) 

A= x 



where: IG(C,A ) information gain for feature A; .S' , Ath character in solution string S; 
T a Ath character in solution string T; m total number of features; x crossover point in 
solution pair S and T, where 1 <x<m. 

Maximizing the IG differential between solution pairs in the crossover process 
allows the creation of potentially better solutions. Solutions with higher IG contain 
attributes that may have greater discriminatory potential while the lower IG solu- 
tions help maintain the diversity balance in the solution population. Such balance is 
important to avoid premature convergence of solution populations toward local 
maxima (Aggarwal et al. 1997). 
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5.2.5 Mutation 

The traditional GA mutation operator randomly mutates individual feature charac- 
ters in a solution string based on a mutation probability constant P m . The EWGA 
mutation operator factors the attribute information gain into the mutation probabil- 
ity as shown below. This is done in order to improve the likelihood of inclusion into 
the solution string for features with higher information gain while decreasing the 
probability of features with lower information gain. Our mutation operator sets the 
probability of a bit to mutate from 0 to 1 based on the feature’s information gain, 
whereas the probability to mutate from 1 to 0 is set to the value one minus the fea- 
ture’s information gain. Balakrishnan et al. (2004) demonstrated the potential for 
modified mutation operators that favored features with higher weights in their hybrid 
genetic algorithm geared toward product design optimization. 

p \B[IG(C,A)1 if S A =0 

{B[l-IG(C,A)],if S A =1 



where P m (A) probability of mutation for feature A; /G(C, A) information gain for 
feature A; S A Ath character in solution string S,B constant in the range 0-1 . 



5.3 Classification 

Because our research focus is on sentiment feature extraction and selection, in all 
experiments, SVM is used with tenfold cross-validation and bootstrapping to clas- 
sify sentiments. We chose SVM in our experiments because it has outperformed 
other machine learning algorithms for various text classification tasks (Pang et al. 
2002; Abbasi and Chen 2005; Zheng et al. 2006). We use a linear kernel with the 
sequential minimal optimization (SMO) algorithm (Platt 1999) included in the 
Weka data mining package (Witten and Frank 2005). 



6 System Evaluation 

Experiments were conducted on English and Arabic Web forums. The overall accu- 
racy was the average classification accuracy across all tenfold where the classifica- 
tion accuracy was computed as follows: 

. . Number of Correctly Classified Documents 

Classification Accuracy = 



Total Number of Documents 
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In addition to tenfold cross-validation, bootstrapping was used to randomly 
select 50 samples for statistical testing, as done in previous research (e.g., Whitelaw 
et al. 2005). For each sample, we used 5% of the instances for testing and the other 
95% for training. Pairwise t tests were performed on the bootstrap values to assess 
statistical significance. 

We conducted two experiments to evaluate the effectiveness of our features as well 
as feature selection methods for sentiment classification of messages from English 
and Arabic extremist Web forums. SVM was run using tenfold cross-validation, with 
900 messages used for training and 100 for testing in each fold. Bootstrapping was 
performed by randomly selecting 50 messages for testing and the remaining 950 for 
training, 50 times. In experiment la, we evaluated the effectiveness of syntactic and 
stylistic features. Experiment lb focused on evaluating the effectiveness of feature 
selection for sentiment analysis across English and Arabic fomms. 



6.1 Test Bed 

Our test bed consists of messages from two major extremist forums (one US and 
one Middle Eastern) collected as part of the Dark Web project (Chen 2006). This 
project involves spidering the Web and collecting Web sites and forums relating to 
hate and extremist groups. The initial list of group URLs is collected from authorita- 
tive sources such as government agencies and the United Nations. These URLs are 
then used to gather additional relevant forums and Web sites. 

The US forum www.nazi.org is an English forum that belongs to the Libertarian 
National Socialist Green Party (LNSG). This is an Aryan supremacist group that 
gained notoriety when a forum member was involved in a school shooting in 2004. 
The Middle Eastern forum www.la7odood.com is a major Arabic-speaking partisan 
forum discussing the war in Iraq and support for the insurgency. The forum’s con- 
tent includes numerous al-Qaeda speeches and beheading videos. 

We randomly selected 1 ,000 polar messages from each forum, which were man- 
ually tagged. The polarized messages represented those in favor of (agonists) and 
against (antagonists) a particular topic. The number of messages used is consistent 
with previous classification studies (Pang et al. 2002). In accordance with previous 
sentiment classification experiments, a maximum of 30 messages was used from 
any single author. This was done in order to ensure that sentiments were being clas- 
sified as opposed to authors. For the US forum, we selected messages relating to 
racial issues. Agonistic sentiment messages were considered to be those in favor of 
racial diversity. In contrast, antagonistic sentiment messages had content denounc- 
ing racial diversity, integration, interracial marriages, and race mixing. For the 
Middle Eastern forum, we selected messages relating to the insurgency in Iraq. 
Agonistic messages were considered to be those opposed to the insurgency. These 
messages had positive sentiments about the Iraqi government and US troops in Iraq. 
Antagonistic sentiment messages were those in favor of the insurgents and against 
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Table 10.5 Characteristics of English and Arabic test bed 



Forum 


Messages 


Authors Average length (char.) 


Data range 


US 


1,000 


114 


854 




3/2004-9/2005 


Middle Eastern 


1,000 


126 


1,126 




1 1/2005-3/2006 


Table 10.6 Characteristics of English and Arabic test bed 


US forum 


Features 


Accuracy (%) 




Bootstrap (%) 


Standard dev. 


Number features 


Stylistic 


71.40 




71.07 


3.324 


867 


Syntactic 


87.00 




87.13 


2.439 


12,014 


Stylistic + syntactic 


90.63 




90.59 


2.042 


12,881 


Middle Eastern forum 










Features 


Accuracy (%) 




Bootstrap (%) 


Standard dev. 


Number features 


Stylistic 


80.20 




80.01 


4.145 


1,166 


Syntactic 


85.42 




85.23 


2.457 


12,645 


Stylistic + syntactic 


90.81 




90.69 


2.093 


13,811 



the current Iraqi government and US forces. These messages had negative senti- 
ments about the Iraqi government and US troops. The occurrence of messages with 
opposing sentiments is attributable to the presence of agitators (also referred to as 
trolls) and debaters in these forums (Donath 1999; Herring et al. 2002; Viegas and 
Smith 2004). Thus, while the majority of the forum membership may have negative 
sentiments about a topic, a subset has opposing sentiment polarity. For the sake of 
simplicity, from here on, we will refer to agonistic messages as “positive” and 
antagonistic messages as “negative” as these terms are more commonly used to 
represent the two sides in most previous sentiment analysis research. Here, we use 
the terms positive and negative as indicators of semantic orientation with respect to 
the specific topic; however, the “positive” messages may also contain sentiments 
about other topics (which may be positive or negative) as described by Wiebe et al. 
(2005). This is similar to the document-level annotations used for product and movie 
reviews (Pang et al. 2002; Yi et al. 2003). Using two human annotators, 500 positive 
(agonistic) and 500 negative (antagonistic) sentiment messages were incorporated 
from each forum. Both annotators/coders were bilingual, fluent in English and 
Arabic. The message annotation task by the independent coders had a kappa (k) 
value of 0.90 for English and 0.88 for Arabic, which is considered to be reliable, 
suggesting sufficient intercoder reliability. Table 10.5 shows some summary statis- 
tics for our English and Arabic Web forum test bed. 



6.2 Experiment la: Evaluation of Features 

In our first experiment, we repeated the feature set tests previously performed on the 
movie review dataset. The three permutations of stylistic and syntactic features 
were used. Table 10.6 shows the results for the three feature sets across the US and 
Middle Eastern forum message datasets. 
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Table 10.7 P values for pairwise t tests on accuracy (n = 50) 



Features/test bed 


US 


Middle Eastern 


Sty. vs. syn. 


<0.0001* 


<0.0001* 


Sty. vs. syn + sty. 


<0.0001* 


<0.0001* 


Syn. vs. syn + sty 


<0.0001* 


<0.0001* 



*P - values significant at alpha = 0.05 



The best classification accuracy results using SVM were achieved when using 
both syntactic and stylistic features. The combined feature set statistically outper- 
formed the use of only syntactic or stylistic features across both datasets. The increase 
was more prevalent in the Middle Eastern forum messages, where the use of stylistic 
and syntactic features resulted in a 5% improvement in accuracy over the use of 
syntactic features alone. Surprisingly, stylistic features alone were able to attain over 
80% accuracy for the Middle Eastern messages, nearly a 9% improvement in the 
effectiveness of these features as compared to the English forum messages. This 
finding is consistent with previous stylometric analysis studies that have also found 
significant stylistic usage in Middle Eastern forums, including heavy usage of fonts, 
colors, elongation, numbers, and punctuation (Abbasi and Chen 2005). 

Table 10.7 shows the pairwise t tests conducted on the bootstrap samples to eval- 
uate the statistical significance of the improved results using stylistic and syntactic 
features. As expected, syntactic features outperformed stylistic features when both 
were used alone. However, using both feature categories significantly outperformed 
the use of either category individually. The results suggest that stylistic features are 
prevalent and important in Web discourse, even when applied to sentiment 
classification. 



6.3 Experiment lb: Evaluation of Feature Selection Techniques 

This experiment was concerned with evaluating the effectiveness of feature selec- 
tion for sentiment classification of Web forums. The same experimental settings as 
experiment la were used for all techniques. Table 10.8 shows the results for the four 
feature reduction methods and the number feature selection baseline applied across 
the US and Middle Eastern forum messages. All four feature selection techniques 
improved the classification accuracy over the baseline. The EWGA had the best 
performance across both test beds in terms of overall accuracy, resulting in a 3-4% 
improvement in accuracy over the number feature selection baseline. Furthermore, 
the EWGA was also the most efficient in terms of the number of features used, 
improving accuracy while utilizing a smaller subset of the initial feature sets. 
EWGA-based feature selection was able to identify a more concise set of key fea- 
tures that was 50-70% smaller than IG and SVM weights (SVMW) and 75-90% 
smaller than the baseline. GA also used a smaller number of features; however, the 
use of EWGA resulted in considerably improved accuracy. 
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Table 10.8 Experiment lb results 



US forum 



Technique 


Tenfold CV (%) 


Bootstrap (%) 


Standard dev. 


Number features 


Base 


90.61 


90.56 


1.831 


12,881 


IG 


92.22 


92.10 


1.612 


1,057 


GA 


91.83 


91.64 


1.396 


511 


SVMW 


92.33 


92.28 


1.512 


1,000 


EWGA 


94.72 


94.94 


1.671 


502 


Middle Eastern forum 








Technique 


Tenfold CV (%) 


Bootstrap (%) 


Standard dev. 


Number features 


Base 


90.79 


90.57 


1.932 


13,811 


IG 


93.41 


93.38 


1.665 


1,045 


GA 


92.14 


92.24 


1.438 


462 


SVMW 


93.28 


93.26 


1.337 


1,000 


EWGA 


93.62 


93.84 


2.831 


338 



Table 10.9 P values for pairwise t tests on 


accuracy (n = 50) 


Technique/test bed 


US 


Middle Eastern 


Base vs. IG 


<0.0001* 


<0.0001* 


Base vs. GA 


0.0001* 


0.0134* 


Base vs. EWGA 


<0.0001* 


<0.0001* 


Base vs. SVMW 


<0.0001* 


<0.0001* 


IG vs. GA 


0.0356* 


0.0685 


IG vs. EWGA 


<0.0001* 


0.2783 


IG vs. SVMW 


0.2934 


0.4130 


GA vs. EWGA 


<0.0001* 


0.0456* 


GA vs. SVMW 


0.0279* 


0.0728 


SVMW vs. EWGA 


<0.0001* 


0.2025 



*P - values significant at alpha = 0.05 



Table 10.9 shows the pairwise t tests conducted on the bootstrap values to evalu- 
ate the statistical significance of the improved results using feature selection. 

EWGA outperformed the baseline and GA for both datasets significantly. In 
addition, EWGA provided significantly better performance than IG and SVMW on 
the English Web forum messages. EWGA also outperformed IG and SVMW on the 
Middle Eastern forum dataset, though the improved performance was not statisti- 
cally significant. 



6.4 Results Discussion 

Fig. 10.4 shows the selection accuracy and number of features selected (out of over 
12,800 potential features) for the US forum using EWGA as compared to GA across 
the 200 iterations (average of tenfold). The Middle Eastern forum graphs looked 
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Iteration 




Iteration 



Fig. 10.4 US forum results using EWGA and GA 



similar to the US forum and were hence not included. The EWGA accuracy declines 
initially despite being seeded with the IG solution. This is due to the use of genera- 
tion replacement which prevents the IG solution from dominating the other solutions 
and creating a stagnant solution population. As intended, the IG solution features are 
gradually disseminated to the remaining solutions in the population until the new 
solutions begin to improve in accuracy around the 20th iteration. Overall, the EWGA 
is able to converge on an improved solution while only using half of the features 
originally transferred from IG. It is interesting to note that EWGA and GA both 
converge to a similar number of features when applied to the US forum; however, 
the EWGA is better able to isolate the more effective sentiment discriminators. 



6.4.1 Analysis of Key Sentiment Features 

We chose to analyze the EWGA features since they provided the highest perfor- 
mance with the most concise set of features. Thus, the EWGA-selected features are 
likely to be the most significant discriminators with the least redundancy. Figure 10.5 
shows the number of each feature category selected by the EWGA for the English 
and Arabic feature set. As expected, more syntactic features (POS tags, n-grams, 
word roots) were used since considerably more of these features were included. 

While Fig. 10.5 shows the number of features selected by the EWGA for each 
feature category. Fig. 10.6 shows the percentage of the overall number of features in 
each category that were selected. For example, the EWGA selected 12 structural 
features from the US (English) feature set; however, this represents 86% percent of 
the structural features, as shown in Fig. 10.6. 

Looking at theusage percentage, stylistic features were more efficient than word 
n-grams and POS tags/roots. Many of the stylistic feature groups had over 40% 
usage whereas syntactic features rarely had such high usage with the exception of 
punctuation. For the US feature set, some categories such as word length, vocabu- 
lary richness, special characters, and structural features had well over 80% repre- 
sentation in the final feature subset. Comparing across regions, US features had 
higher usage rates than the Middle Eastern feature set. Approximately 10% of the 
Middle Eastern features were used by the EWGA versus 25% of the US attributes. 
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Fig. 10.5 Key feature usage frequencies by category 
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Fig. 10.6 Key feature usage percentage by category 



6.4.2 Key Stylistic Features 

Figure 10.7 shows some of the important stylistic features for the US forum. The 
diagram to the left shows the normalized average feature usage across all positive 
and negative sentiment messages. The table to the right shows the description for 
each feature as well as its IG and SVM weight. 

The positive sentiment messages (agonists, in favor of racial diversity) tend to be 
considerably shorter (feat. 1), containing a few long sentences. These messages also 
feature heavier usage of conjunctive function words such as “however,” “therefore,” 
and “nevertheless” (feat. 6-8). In contrast, the negative sentiment messages are 
nearly twice as long and contain lots of digits (feat. 5) and special characters (feat. 
2 - 4 ). Higher digit usage in the negative messages is due to references to news arti- 
cles used to stereotype. Article snippets begin with a date, resulting in the higher 
digit count. The negative messages also feature shorter sentences. The stylistic fea- 
ture usage statistics suggest that the positive sentiment messages follow more of a 
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Feature # 



Feature 


IG 


SVM 


total char. 


0.027 


0.243 


$ 


0.029 


0.130 


& 


0.017 


0.141 


{ 


0.012 


0.126 


digit count 


0.015 


0.316 


therefore 


0.021 


-0.104 


however 


0.017 


-0.120 


nevertheless 


0.014 


-0.119 



Fig. 10.7 Key stylistic features for US forum 



debating style with shorter, well- structured arguments. In contrast, the negative sen- 
timent messages tend to contain greater signs of emotion. The following verbal 
joust between two members in the US forum exemplifies the stylistic differences 
across sentiment classes. It should be noted that some of the content in the messages 
has been sanitized for vulgar word usage; however, the stylistic tendencies that are 
meant to be illustrated remain unchanged. 

Negative 

You are a total%#$*@ idiot!!! You walk around thinking you’re doing humanity a 
favor, sympathizing with such barbaric slime. They use your sympathy as an excuse 
to fail. They are a burden to us all! ! ! Your opinion means nothing. 

Positive 

Neither does yours. But at least my opinion is an educated and informed one backed 
by well-reasoned arguments and careful skepticism about my assumptions. Race is 
nothing more than a social classification. What have you done for society that allows 
you to deem others a burden? 

Figure 10.8 shows some of the important stylistic features for the Middle Eastern 
forum. 

There are a few interesting similarities between the US and Middle Eastern forum 
feature usage tendencies across sentiment lines. The positive sentiment messages in 
the Middle Eastern forum (agonists, opposed to the insurgency) also tend to be con- 
siderably shorter than the negative sentiment messages in terms of total number of 
characters (feat. 2). Additionally, like their US forum counterparts, the negative 
Arabic messages contain heavy digit usage attributable to news article snippets (feat. 
5). The negative sentiment messages make greater use of stylistic word stretching 
(elongation) which is done in order to emphasize key words (feat. 3). Consequently, 
the negative messages include greater use of words longer than 10 characters (feat. 
4) while the positive messages are more likely to use shorter words, less than 4 char- 
acters in length (feat. 1). The negative sentiment messages also have higher vocabu- 
lary richness (feat. 6-9, various vocabulary richness formulas). 
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Middle Eastern Forum Usage 




Feature # 



Feature 


IG 


SVM 


short words 


0.026 


-0.020 


total char. 


0.039 


0.210 


elongation 


0.037 


0.319 


long words 


0.035 


0.362 


digits 


0.014 


0.086 


Simpson 


0.029 


0.135 


Yule 


0.023 


0.137 


Brunet 


0.027 


0.109 


Honore 


0.024 


0.140 



Fig. 10.8 Key stylistic features for Middle Eastern forum 



Table 10.10 Key n-grams for various sentiment classes 



US forum 




Middle Eastern forum 




Positive (agonist) 


Negative (antagonist) 


Positive (agonist) 


Negative (antagonist) 


Racist terms: 


Racist terms: 


Racist Shia terms: 


Racist Sunni terms: 


“Racism” 


“Criminals” 


“Terrorists” — 


“Freedom 






ji” 


fighters” — 


“Subhuman racist” 


“Whites” 


“Shia” — “Sjtbi” 


“Martyrdom” — “i-g-Aud” 


“Anti-Semitism” 


“Americans” 


“Shiite”—”^” 


“Zarqawi” — 


“Ignorant slime” 


“Get a job” 




“Sunni”—‘ V“” 




"Imwao” 




“American" — “sAo* 1 




“Odin's rage” 




“Iraq” — 




“Urban jungle” 




“International forces” — 

“ (-1)1 jj” 



6.4.3 Key Syntactic Features 

Table 10.10 shows the key word n-grams for each sentiment class selected by the 
EWGA. Many of the terms and phrases were racist content that was not included 
in the table but rather represented using a description label. Items in quotes indicate 
actual terms (e.g., “criminals”) while nonquoted items signify term descriptions 
(e.g., racist terms). For the Middle Eastern forum, sentiments seem to be drawn 
along sectarian lines. In contrast, US forum sentiments are not clearly separated 
along racial lines. While the majority of the negative sentiments toward racial 
issues are generated by white supremacists, many of the positive sentiments are 
also presented by those with the same self-proclaimed racial affiliations. This 
reduced the amount of racial name calling across sentiments in the US forums, 
resulting in the need for considerably larger numbers of n-grams to effectively 
discern sentiment classes. Consequently, the number of n-grams used for the US 
feature set (332) is nearly threefold those used for the Middle Eastern sentiment 
classification (117). 
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7 Conclusions and Future Directions 

In this study, we applied sentiment classification methodologies to English and 
Arabic Web forum postings. In addition to syntactic features, a wide array of English 
and Arabic stylistic attributes including lexical, structural, and function word style 
markers were included. We also developed the entropy weighted genetic algorithm 
(EWGA) for efficient feature selection in order to improve accuracy and identify 
key features for each sentiment class. EWGA significantly outperformed the number 
feature selection baseline and GA on all test beds. It also outperformed IG and 
SVMW on all three datasets (statistically significant for the movie review and US 
forum datasets) while isolating a smaller subset of key features. EWGA demon- 
strated the utility of these key features in terms of classification performance and for 
content analysis. Analysis of EWGA-selected stylistic and syntactic features 
allowed greater insight into writing style and content differences across sentiment 
classes in the two Web forums. Our approach of using stylistic and syntactic fea- 
tures in conjunction with the EWGA feature selection method achieved a high level 
of accuracy, suggesting that these features and techniques may be used in the future 
to perform sentiment classification and content analysis of Web forum discourse. 
Applying sentiment analysis to Web forums is an important endeavor, and the cur- 
rent accuracy is promising for effective analysis of forum conversation sentiments. 
Such analysis can help provide a better understanding of extremist group usage of 
the Web for information and propaganda dissemination. 

In the future, we would like to evaluate the effectiveness of the proposed senti- 
ment classification features and techniques for other tasks such as sentence- and 
phrase-level sentiment classification. We also intend to apply the technique to other 
sentiment domains (e.g., news articles and product reviews). Moreover, we believe 
the suggested feature selection technique may also be appropriate for other forms of 
text categorization and plan to apply our technique to topic, style, and genre classi- 
fication. We also plan to investigate the effectiveness of other forms of GA hybrid- 
ization, such as using the SVM weights instead of the IG heuristic. 
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Chapter 11 

Affect Analysis 



1 Introduction 

The need for enhanced information retrieval and knowledge discovery from 
computer-mediated communication (CMC) archives has been articulated by many 
individuals in recent years. One suggested information access refinement has been 
to mine directional text: text containing emotions and opinions (Hearst 1992; Wiebe 
1994). Affects play an important role in influencing people’s perceptions and 
decision making (Picard 1997). Analysis of sentiment and affects is particularly 
important for online discourse, where such information is often more pervasive than 
topical content (Subasic and Huettner 2001; Nigam and Hurst 2004). With the 
increased popularity of social computing, the presence and significance of affective 
text is likely to grow (Liu et al. 2003). There has been considerable recent work on 
sentiment analysis of online forums and product reviews (Turney and Littman 2003; 
Wiebe et al. 2004). However, research on analysis of affects (including emotions 
and moods) is still relatively sparse (Cho and Lee 2006). While recent studies have 
analyzed the presence of affects in blogs, online stories, chat dialog, transcripts, 
song lyrics, etc., it is unclear which features and techniques are most useful for 
affective computing of online texts. There is therefore a need to compare existing 
features for representing affective content as well as the techniques used for assigning 
emotive intensities. 

In this chapter, we compare features and techniques for classification of affective 
intensities in online text. The features investigated include a large set of learned 
n-grams as well as automatically and manually generated affect lexicons used in 
prior research. We also propose a support vector regression correlation ensemble 
(SVRCE) method for text-based affect classification. SVRCE combines feature 
subset ensembles with affect correlation information for improved affect classification 
performance. Evaluation of the various feature representations and the proposed 
method in comparison with existing affect analysis techniques found that the use of 
SVRCE with n-grams is highly effective for affect classification of online forums, 
blogs, and stories. 
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The remainder of this chapter is organized as follows: Section 2 provides a 
review of related work on textual affect analysis. Section 3 outlines our research 
framework based on gaps and questions derived from the literature review. Section 4 
presents an experimental evaluation of the various features and techniques incorpo- 
rated in our framework. Section 5 features a brief case study illustrating how the 
proposed affect analysis methods can be applied to large CMC archives. Section 6 
contains concluding remarks and describes future research directions. 



2 Related Work 

Affect analysis is concerned with the analysis of text containing emotions (Picard 
1997; Subasic and Huettner 2001). Emotional intelligence, the ability to effectively 
recognize emotions automatically, is crucial for learning-preference-related informa- 
tion and determining the importance of particular content (Picard et al. 2001). Affect 
analysis is associated with sentiment analysis, which looks at the directionality of text, 
i.e., whether a text segment is positively or negatively oriented (Hearst 1 992). However, 
there are two major differences between affect analysis and sentiment analysis. First, 
affect analysis involves a large number of potential emotions or affect classes (Subasic 
and Huettner 2001). These include happiness, sadness, anger, hate, violence, excite- 
ment, fear, etc. In contrast, sentiment analysis primarily deals with positive, negative, 
and neutral sentiment polarities. Second, while the sentiments associated with particu- 
lar words or phrases are mutually exclusive, text segments can contain multiple affects 
(Subasic and Huettner 2001; Grefenstette et al. 2004b). For example, the sentence 
“I can’t stand you!” has only a negative sentiment polarity but simultaneously con- 
tains hate and anger affects. Word-level examples include the verb form of “alarm,” 
which can be attributed to fear, warning, and excitement affects (Subasic and Huettner 
2001), and the adjective “gleeful,” which can be assigned to the happiness and excite- 
ment affect classes (Grefenstette et al. 2004b). Additionally, certain affect classes may 
be correlated (Subasic and Huettner 2001). For instance, hate and anger often co- 
occur in text segments, resulting in a positive correlation. Similarly, happiness and 
sadness are opposing affects that are likely to have a negative correlation. In summary, 
affect analysis involves assigning text with emotive intensities across a set of mutu- 
ally inclusive and possibly correlated affect classes. Important affect analysis charac- 
teristics include the features used to represent the presence of affects in text, techniques 
for assigning affective intensity scores, and the level of text granularity at which the 
analysis is performed. Table 11.1 presents a summary of the relevant prior studies 
based on these important affect analysis characteristics. 

Based on the table, we can make several observations regarding the features and 
techniques used in previous affect analysis research. 

1. Most prior research has used either manually generated lexicons, lexicons 
automatically created using WordNet or semantic orientation, or generic feature 
representations such as word and part-of-speech tag n-grams. It is unclear which 
of these feature representations is most effective for affect analysis. 
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Table 11.1 Related prior affect analysis studies 



Study 


Features 


Technique(s) 


Analysis 

level 


Test bed and results 


Donath et al. 

(1999) 


Manual lexicon, 
punctuation 


Posting scoring 


Posting 


Greek Usenet forums; 
visualization of 
anger intensities 
over time 


Subasic and 
Huettner 
(2001) 


Manual lexicon 
(fuzzy semantic 
typing) 


Word scoring 


Word 


Movie reviews and 
news stories; 
visualization 
of 83 affects 


Liu et al. 
(2003) 


Language patterns 
derived from 
knowledge base 


Sentence scoring 


Sentence 


User study on e-mail 
browser 


Chuang 
and Wu 
(2004) 


Manual lexicon 


Support vector 
machine (SVM) 


Sentence 


Drama broadcast 

transcripts; 76.44% 
accuracy for 7 class 
experiments 


Grefenstette 
et al. 
(2004a) 


Manual lexicon, 
semantic 
orientation 


Manual tagging, 
pointwise mutual 
information (PMI) 


Word 


Candidate affect words; 
scored intensities 
across 86 affects 


Grefenstette 
et al. 
(2004b) 


Manual lexicon 


Word scoring 


Word 


Political web pages; 
scored text relating 
to certain topic 


Read (2004) 


Semantic 

orientation 


Pointwise mutual 
information 
(PMI) 


Sentence 


Short stories; 47.14% 
accuracy for 2 class 
experiments 


Ma et al. 
(2005) 


Manual lexicon 
(WordNet-Affect 
database) 


Word scoring 


Sentence 


Instant messaging chat 
data; no formal 
evaluation 


Mishne 

(2005) 


BOWs, POS tags, 
document length, 
emphasized 
words, semantic 
orientation, 
WordNet lexicon 


Support vector 
machine (SVM) 


Posting 


Live Journal blog 
postings; 60.25% 
accuracy for 2 class 
experiments 


Cho and Lee 
(2006) 


Manual lexicon, 
BOWs 


Sentencez scoring, 
support vector 
machine (SVM) 


Song 


Korean song lyrics; 
77.3% accuracy on 
5 class experiments 


Mishne and 
Rijke 
(2006) 


Word n-grams 


Pace regression 


Posting 


Live Journal blog 
postings; average 
error of 52.53%, 
correlation 
coefficient of 0.827 
for 2 class 
experiments 


Wu et al. 
(2006) 


Emotion generation 
and association 
rules 


Separable mixture 
models 


Posting 


Student chat dialog; 
80.98% accuracy 
for 3 class 
experiments 
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2. Techniques used for assigning affect intensities can be predominantly categorized 
into scoring methods and machine learning techniques. However, we are unaware of 
any prior work attempting to compare various techniques for affect classification. 

3. Previous affect classification studies typically utilized between two and seven 
affect classes, applied at the word, sentence, or document levels. Despite the 
presence of multiple interrelated affects (Subasic and Huettner 2001 ; Grefenstette 
et al. 2004b), class correlation information was not leveraged for improved affect 
intensity assignment. Additionally, regression-based methods have seen limited 
usage despite their effectiveness in related application domains (Pang and Lee 
2005; Schumaker and Chen 2006). 

4. Prior studies mainly focused on a single application domain, such as movie 
reviews, web forums, blogs, chat dialog, song lyrics, stories, etc. Given the differ- 
ences in the degree of interaction, language usage, and communication structure 
across these domains, it is unclear if an approach suitable for classifying story 
affects will be applicable on web forums and blogs. The features and techniques 
used in prior affect analysis research are expounded upon in the remainder of 
the section. 



2.1 Features for Affect Analysis 

The attributes used to represent affects can be classified into lexicon-based features 
and generic n-gram-based features. Considerable prior research has used manually 
or automatically generated lexicons. As previously stated, in affect lexicons, the 
same word/phrase can be assigned to multiple affect classes. The intensity score for 
an attribute is based on its degree of severity toward that particular affect class. 
Depending upon the semantic relation between affects, certain classes can have a 
positive or negative occurrence correlation (Subasic and Huettner 2001). 

Many studies have incorporated manually developed affect lexicons. Subasic 
and Huettner (2001 ) used fuzzy semantic typing where each feature was assigned to 
multiple affect categories with varying intensity and centrality scores depending 
upon the word and usage context. For example, the word “rat” was assigned to the 
disloyalty, horror, and repulsion affect categories with intensity scores of 0.9, 0.6, 
and 0.7, respectively (on a 0.0-1. 0 scale where 1.0 was highest). In order to compen- 
sate for word-sense ambiguity, their approach also assigned each word-affect pair a 
centrality score indicating the likelihood of the word being used for that particular 
affect class. For example, the word “rat” was assigned a centrality score of 0.3 for 
the disloyalty affect and 0.6 for the repulsion affect (also on a 0. 0-1.0 scale) since 
the usage of “rat” to convey disloyalty is not as common. Thus, while “rat” was 
more intense for the disloyalty affect, it was also less central to this class. In Subasic 
and Huettner’s (2001) approach, the intensity and centrality scores were both uti- 
lized for determining the affective composition of a text document. Although the 
accuracy for specific term affects may be inaccurate, the fuzzy logic approach is 
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intended to capture the essence of a document’s various affect intensities. A similar 
method for generating manual lexicons was employed in related work (Grefenstette 
et al. 2004a, b). Many other studies have also utilized manually constructed affect 
lexicons (Chuang and Wu 2004; Cho and Lee 2006). Donath et al. (1999) used a set 
of keywords relating to anger for analyzing Usenet forums. Ma et al. (2005) incor- 
porated the WordNet-Affect database created by Valitutti et al. (2004). This database 
is comprised of manually assigned affect intensities for words found in the WordNet 
lexical resource (Fellbaum 1 998). Liu et al. (2003) manually constructed sentence- 
level language patterns for identification of six affect classes, including happiness, 
sadness, anger, fear, etc. 

Although manually created affect lexicons can provide powerful insight, their 
construction can be time consuming and tedious. As a result, many studies have 
explored the use of automated lexicon generation methods such as semantic orienta- 
tion (Grefenstette et al. 2004a; Read 2004; Mishne 2005) and WordNet lexicons 
(Mishne 2005). These methods take a small set of manually generated seed/ 
paradigm words which accurately reflect the particular affect class and use automated 
methods for lexicon expansion of candidate word scoring. 

Based on the work of Turney and Littman (2003), the semantic orientation 
approach assesses the intensity of each word based on its frequency of co-occurrence 
with a set of core paradigm words reflective of that affect class (Grefenstette et al. 
2004a). The occurrence frequencies for the paradigm words and candidate words 
are derived from search engines such as AltaVista (Grefenstette et al. 2004a; Read 
2004; Mishne 2005) or Yahoo! (Mishne 2005). The number of paradigm words used 
for a particular affect class is generally five to seven (Grefenstette et al. 2004a; Read 
2004). For example, the paradigm words for the praise affect may include “acclaim, 
praise, congratulations, homage, approval” (Grefenstette et al. 2004a), and addi- 
tional lexicon items generated automatically using semantic orientation include the 
words “award, honor, extol.” The semantic orientation approach is typically coupled 
with a pointwise mutual information (PMI) scoring mechanism for assigning candi- 
date words intensity scores (Turney and Littman 2003). Traditional PMI assigns 
each word a score based on how often it occurs in proximity with positive and 
negative paradigm words; however, it has been modified to be applicable with affect 
classes (Read 2004; Grefenstette et al. 2004a). The affect analysis rendition of PMI 
proposed by Grefenstette et al. (2004a) is as follows: 



PMI Scor e(word, Class ) = log. 



FI hits {word Near cword ) ' 

cword £ Class 

FI log 2 (hits(cwoni)) 

, cword e Class 



where cword is one of the paradigm words chosen for an affect class Class and hits 
is the number of pages found by Alta Vista. 

Another automated affect lexicon generation method is WordNet lexicons. 
Originally proposed by Kim and Hovy (2004), this method is similar to semantic 
orientation. However, it uses WordNet to expand the seed words associated with a 
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particular affect class by comparing each candidate word’s synset with the seed 
word list (Mishne 2005). The intensity for a candidate word is proportional to the 
percentage of its synset also present in the seed word list for that particular affect 
class. Word scores are assigned using the following formula (Kim and Hovy 2004): 

WordNet Scoref word. Class) = P (Class) V count (syti , Class ) 

count(c) ~~l 

where Class is an affect class, syn. is one of the n synonyms of word , and P(Class) 
is the number of words in Class divided by the total number of words considered. 

In addition to lexicon-based affect representations, studies have also used generic 
n-gram features. Mishne (2005) used bag-of-words (BOWs) and part-of-speech 
(POS) tags in combination with automatically generated lexicons, while Mishne 
and Rijke (2006) used word n-grams for affect analysis of blog postings. Cho and 
Lee (2006) used BOWs for classifying affects inherent in Korean song lyrics. 
N-grams have also been shown to be highly effective in the related area of sentiment 
classification (Wiebe et al. 2004; Abbasi et al. 2008), especially when combined 
with machine learning methods capable of learning n-gram patterns conveying 
opinions and emotions. While prior research has used various n-gram and lexicon 
representations, we are unaware of any work done to evaluate the effectiveness of 
various potential affect analysis features. 



2.2 Techniques for Assigning Affect Intensities 

Prior research has utilized scoring and machine learning methods for assigning affect 
intensities. Scoring-based methods, which are generally used in conjunction with 
lexicons, typically use the average intensity across lexicon items occurring in the text 
(i.e., word spotting) (Subasic and Huettner 2001; Liu et al. 2003; Cho and Lee 2006). 
Sentence-level averaging has also been performed in combination with the word-level 
PMI scores generated using semantic orientation (Turney and Littman 2003) as well 
as with WordNet lexicons (Kim and Hovy 2004). Studies that directly developed 
lexicons comprised of sentence patterns obviously do not use averaging (at least at the 
sentence level), but instead simply matching sentences with lexicon entries and assign- 
ing intensity scores accordingly (Liu et al. 2003; Cho and Lee 2006). 

Machine learning techniques have also been used for assigning affect intensities. 
Many studies used support vector machine (SVM) for determining whether a text 
segment contained a particular affect class (Chuang and Wu 2004; Mishne 2005; 
Cho and Lee 2006). One shortcoming of using SVM is that it can only deal with 
discrete class labels, whereas affect intensities can vary along a continuum. Recent 
work has attempted to address this problem by using regression-based classifiers 
(Pang and Lee 2005). For example, Mishne and Rijke (2006) used word n-grams in 
unison with Pace regression (Witten and Frank 2005) for assigning affect intensities 
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in LiveJournal blogs. Nevertheless, regression-based learning methods have seen 
limited usage despite their effectiveness in related application domains such as 
using news story text for stock price prediction (Schumaker and Chen 2006). 
Furthermore, although scoring and machine learning methods have been utilized for 
classifying affect intensities, there has been no research done to investigate the 
effectiveness of these methods. 



3 Research Design 

In this section, we highlight affect analysis research gaps based on our review of the 
related work. Research questions are then posed based on the relevant gaps identified. 
Finally, a research framework is presented in order to address these research 
questions, along with some research hypotheses. The framework encompasses 
various feature representations and techniques for assigning affective intensities 
to sentences. 



3.1 Gaps and Questions 

Prior research has utilized manually or automatically generated lexicons as well as 
generic n-gram features for representing affective content in text. Since most studies 
used a single feature category and did not compare different alternatives, it is unclear 
which emotive representation is most effective. Furthermore, prior research has 
used scoring-based techniques and machine learning methods such as SVM. 
Regression-based methods capable of assigning continuous intensity scores have 
not been explored in great detail, with the exception of Mishne and Rijke (2006). 
Leveraging the relationship between mutually inclusive affect classes in combina- 
tion with powerful regression-based machine learning methods such as support 
vector regression (SVR) could be highly effective for accurate assignment of affect 
intensities. Additionally, most prior affect analysis research was applied to a single 
domain (e.g., blogs, stories, etc.). Application across multiple domains could lend 
greater validity to the effectiveness of affect analysis features and techniques. Based 
on these gaps, we present the following research questions: 

• Which feature categories are best at accurately assigning affect intensities? 

- Can the use of an extended feature set enhance affect analysis performance 
over individual generic and lexicon-based feature categories? 

• Can a regression ensemble that incorporates affect correlation information 
outperform existing machine learning and scoring-based methods? 

• What impact will the application domain have on affect intensity assignment? 
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Fig. 11.1 Affect analysis research framework 



3.2 Research Framework 

Our research framework (shown in Fig. 11.1) relates to the features and techniques 
used for assigning affect intensity scores. 

We intend to compare generic n-gram features with automatically and manually 
generated lexicons. We also plan to assess the effectiveness of using an extended 
feature set encompassing all these attributes in comparison with individual feature 
categories. With respect to affect analysis techniques, we propose a support vector 
regression (S VR) ensemble that considers affect correlation information when assign- 
ing emotive intensities to sentences. We intend to compare the SVR correlation 
ensemble (SVRCE) with other machine learning and scoring-based methods used in 
prior research. These include Pace regression (Witten and Frank 2005; Mishne and 
Rijke 2006), semantic orientation (Grefenstette et al. 2004a; Read 2004), WordNet 
(Kim and Hovy 2004), and manual lexicon scoring (Subasic and Huettner 2001). 

We also plan to perform ablation testing to see how the different components of 
the proposed SVRCE method contribute to its overall performance. All testing will 
be performed on several test beds encompassing sentences derived from web forums, 
blogs, and stories. Features and techniques will be evaluated with respect to their 
percentage mean error and correlation coefficients in comparison with a human- 
annotated gold standard. Further details about the features, techniques, ablation 
testing, and our research hypotheses are presented below, while the test bed and 
evaluation metrics are discussed in greater detail in the ensuing evaluation section. 
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3.2.1 Affect Analysis Features 

The n-gram feature set is comprised of word, character, and part-of-speech (POS) 
tag n-grams. For each n-gram category, we used up to trigrams only (i.e., unigrams, 
bigrams, and trigrams), as done in prior related research (Pang et al. 2002; Wiebe 
et al. 2004). Word n-grams, including unigrams (e.g., “LIKE”), bigrams (e.g., 
“I LIKE,” “LIKE YOU”), and trigrams (e.g., “I LIKE YOU”), as well as POS tag 
n-grams (e.g., “NP VB,” “JJ NP VB”) have been used in prior affect analysis 
research (Mishne 2005). We also include character n-grams (e.g., “li,” “ik,” “ike”), 
which have been useful in related sentiment classification studies (Abbasi et al. 
2008). In addition to standard word n-grams, we incorporate hapax legomena and 
dis legomena collocations (Wiebe et al. 2004). Such collocations replace once- 
(hapax legomena) and twice-occurring words (dis legomena) with “HAPAX” and 
“DIS” tags. Hence, the trigram “I hate Jim” would be replaced with “I hate HAPAX” 
provided “Jim” only occurs once in the corpus. The intuition behind such colloca- 
tions is to remove sparsely occurring words with tags that will allow the extracted 
n-grams to be more generalizable, and hence, more useful (Wiebe et al. 2004). For 
instance, in the above example, the fact that the writer hates is more important from 
an affect analysis perspective than the specific person the hate is directed toward. 

The lexicons employed are comprised of automated lexicons derived using 
semantic orientation and WordNet models as previously done by Grefenstette et al. 
(2004a) and Mishne (2005). We selected seven paradigm words for each affect class 
for input into the semantic orientation algorithm, as described in Sect. 2.1. For the 
WordNet models, sets of up to 50 words were used as the seeds, following the 
guidelines described by Kim and Hovy (2004). 

Our feature set also consists of a manually crafted word-level lexicon. The lexicon 
is comprised of over 1,000 affect words for several emotive classes (e.g., happiness, 
sadness, anger, hate, violence, etc.). Each word is assigned an intensity and ambiguity 
score between 0 and 1. The intensities are assigned based on the word’s degree of 
severity or valence for its particular affect category (with 1 being highest). This 
approach is consistent with the intensity score assignment methods incorporated in 
previous studies that utilized manually crafted lexicons (Donath et al. 1999; Subasic 
and Huettner 2001; Grefenstette et al. 2004b; Chuang and Wu 2004). Each affect 
feature is also assigned an ambiguity score. The ambiguity score is the probability 
of an instance of the feature having semantic congruence with the affect class repre- 
sented by that feature. The ambiguity score for each feature is determined by taking 
a sample set of instances of the feature’s occurrence and coding each occurrence as 
to whether the term usage is relevant to its affect. A maximum of 20 samples was 
used per term. Using more instances would be exhaustive, and we observed that the 
size used was sufficient to accurately capture the probability of an affect being 
relevant. The ambiguity score for each word can be computed as the number of 
correctly appearing instances divided by the total number of instances sampled for 
that word. Hence, an ambiguity value of one suggests that the term always appears 
in the appropriate affective connotation. The intensity and ambiguity assignment 
was done by two independent coders. Each coder initially assigned values without 
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Table 11.2 Manual lexicon examples for the violence affect 



Term 


Intensity 


Ambiguity 


Weight 


Hit 


0.210 


0.800 


0.168 


Beat 


0.400 


0.667 


0.267 


Stab 


0.575 


1.000 


0.575 


Hang 


0.800 


0.650 


0.520 


Kill 


0.850 


0.950 


0.808 


Lynch 


1.000 


1.000 


1.000 



consulting the other. The coders then consulted one another in order to resolve 
tagging differences. The inter-coder reliability tests revealed a kappa statistic of 
0.78 prior to coder discussions and 0.89 after discrepancy resolution. For situations 
where the disparity could not be resolved even after discussions, the two coders’ 
values were averaged. Table 11.2 shows examples from the violent affect lexicon. 
The weight for each term is the product of its intensity and ambiguity value. This is 
the value assigned to each occurrence of the term in the text being analyzed. For 
example, “lynch” was considered more severe by the coders than “hang.” Although 
the two terms represent similar actions, the more severe motivation behind “lynch” 
as compared to “hang” resulted in a higher intensity score. Furthermore, the word 
“lynch” was also less ambiguous, conveying only a single violent meaning in the 
samples analyzed by the coders during the disambiguation procedure. 



3.2.2 Affect Analysis Techniques 

Ensemble classifiers use multiple classifiers, with each built using different techniques, 
training instances, or feature subsets (Dietterich 2000). Particularly, the feature 
subset classifier approach has been shown to be effective for analysis of style and 
patterns. Stamatatos and Widmer (2002) used an SVM ensemble for music performer 
recognition. They used multiple SVMs, each trained using different feature subsets. 
Similarly, Cherkauer (1996) used a neural network ensemble for imagery analysis. 
Their ensemble consisted of 32 neural networks trained on eight different feature 
subsets. The intuition behind using a feature ensemble is that it allows each classifier 
to act as an “expert” on its particular subset of features (Cherkauer 1996; Stamatatos 
and Widmer 2002), thereby improving performance over simply using a single 
classifier. We propose the use of a support vector regression ensemble that incorpo- 
rates the relationship between various affect classes in order to enhance affect clas- 
sification performance. Our ensemble includes multiple SVR models, each trained 
using a subset of features most effective for differentiating emotive intensities for a 
single affect class. We use the information gain (IG) heuristic to select the features 
for each SVR classifier. Since affect intensities are continuous, discretization must be 
performed before IG can be applied. We use 5 and 10 class bins (e.g., an intensity 
value of 0.15 would be placed into class 1 of 5 and 2 of 10 using 5 and 10 class bins). 
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Let M={l,2 ....rajdenote the set of training instances. 

The SVR correlation ensemble intensity score for instance i and affect class c can be computed as follows: 



Where: 

SVR C (7) is the prediction for instance i for affect class c using an SVR model trainde on M; 
the feature subset for SVR C is selected using the IG heuristic; 
c and a are part of the set of n affect classes being investigated, and c * a; 

K= 1 if Corr (c,a) > 0, K = — 1 otherwise; 

Corr (c,a) is the correlation coefficient for affect classes c and a across the m training instances as follows: 



c x and a x are the actual intensity values for affects c and a assigned to x e M; 

cand a are the average intensity values for affects c and a across the m training instances. 

Fig. 11.2 SVR correlation ensemble for assigning affect intensities 

All features with an average information gain greater than a threshold t are selected, 
as done in prior research (Yang and Pederson 1997). 

The support vector regression correlation ensemble (SVRCE) adjusts the affect 
intensity prediction for a particular sentence based on the predicted intensities of 
other affects. The amount of adjustment is proportional to the level of correlation 
between affect classes (i.e., the affect class being predicted and the ones being used 
to make the adjustment) as derived from the training data. The SVRCE formulation 
is shown in Fig. 1 1.2. 

The rationale behind SVRCE is that in certain situations, a particular sentence 
may get misclassified by a trained model due to a lack of prior exposure to the 
affective cues inherent in its text. In such circumstances, leveraging the relation- 
ship between affect classes may help alleviate the magnitude of such erroneous 
classifications. 

We intend to compare the proposed SVRCE method against machine learning 
and scoring-based methods used in prior affect analysis research. These include the 
Pace regression technique proposed by Witten and Frank (2005) which was used to 
analyze affect intensities in weblogs (Mishne and Rijke 2006) as well as the semantic 
orientation, WordNet model, and manual lexicon scoring approaches. In addition to 
comparing the proposed SVRCE against other affect analysis techniques, we also 
intend to perform ablation testing to better understand the impact different compo- 
nents of our proposed method have on classification performance. Since SVRCE 
uses correlation information and feature-subset-based ensembles, we plan to com- 
pare it against an SVR ensemble that does not use correlation information as well as 
an SVR trained using a single feature set for all affect classes. The hypotheses 
associated with our research framework are presented below. 



SVRCE c (i) = SVR C (;) + jj(Corr(c,a) 2 (SVR a (i) - SVR c (i))^ ) 




Corr(c,a) 




For: 
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3.3 Research Hypotheses 

H 1 : Features 

The use of learned generic n-gram features will outperform manually and auto- 
matically crafted affect lexicons. Additionally, using an extended feature set encom- 
passing all features will outperform individual feature sets. 

• HI a: N-Grams > manual lexicon, semantic orientation, WordNet models 

• Hlb: All features >n-grams, manual lexicons, semantic orientation, WordNet 
models 

H2: Techniques 

The proposed SVRCE method will outperform comparison techniques used in prior 
studies for affect analysis. 

• H2: SVRCE > Pace regression, semantic orientation scores, WordNet model 
scores, manual lexicon scores 

H3: Ablation Testing 

The SVRCE method will outperform an SVR ensemble not using correlation infor- 
mation as well as SVR run using a single feature set. Furthermore, the SVR ensem- 
ble will also significantly outperform SVR run using a single feature set. 

• H3a: SVRCE > SVR ensemble, SVR 

• H3b: SVR ensemble > S VR 



4 Evaluation 

We conducted experiments to evaluate various affective feature representations 
along with different affect analysis techniques, including the proposed support vector 
regression correlation ensemble (SVRCE). The experiments were conducted on four 
test beds comprised of sentences taken from web forums, blogs, and short stories. 
This section encompasses a description of the test beds, experimental design, exper- 
imental results, and outcomes of the hypotheses testing. 



4.1 Test Bed 

Analyzing affect intensities across application domains is important in order to get 
a better sense of the effectiveness and generalizability of different features and 
techniques. As a result, our test bed consisted of sentences taken from two corpora 
(shown in Table 1 1.3). The first test bed was a set of supremacist web forums dis- 
cussing issues relating to Nazi and socialist ideologies. The second was comprised 
of 1,000 sentences taken from a couple of Arabic language Middle Eastern forums 
discussing issues relating to the war in Iraq. Analysis of such forums is important to 
better understand cyber activism, social movements, and people’s political sentiments. 
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Table 11.3 Test bed description 



Test bed name 


Source URL(s) 


No. of sentences 


Affect classes 
tagged 


Inter-coder 

reliability 


Supremacist Web 
forums (SF) 


www.stormfront.org 

www.nazi.org 


1,000 


Violence, anger, 
hate, racism 


0.89 


Middle Eastern Web 
forums (MEF) 


www.montada.com 

www.alfirdaws.com 


1,000 


Violence, anger, 
hate, racism 


0.79* 



* Kappa value from initial tagging 



Two independent coders tagged the sentences for intensities across the four affect 
classes used for each test bed (shown in Table 1 1.3). Each sentence was tagged with 
an intensity score between 0 and 1 (with 1 being most intense) for each of the affects. 
The tagging followed the same format as the one used for the manual lexicon cre- 
ation. Each coder initially assigned values without consulting the other. The coders 
then consulted one another in order to resolve tagging differences. For situations 
where the disparity could not be resolved even after discussion, the two coders’ 
values were averaged. The inter-coder reliability kappa values shown in Table 11.3 
are from after discrepancy resolution (prior to averaging). For the Middle Eastern 
forums, the coders were unable to meet to resolve coding differences. For this test 
bed, the kappa value shown is for the two coders’ initial tagging. 



4.2 Experimental Design 

Based on our research framework and hypotheses presented in Sect. 3, three experi- 
ments were conducted. The first was intended to compare the performance of 
learned n-grams against manually and automatically crafted lexicons. We also 
investigated the effectiveness of an extended feature set comprised of n-grams and 
lexicons versus individual feature groups. The second experiment compared differ- 
ent affect analysis techniques, including the proposed SVRCE, Pace regression, and 
scoring methods. The final experiment pertained to ablation analysis of the major 
components of SVRCE, including the use of correlation information and an ensem- 
ble approach to affect classification. In order to allow statistical testing of results, 
we ran 50 bootstrap instances for each condition across all three experiments. In 
each bootstrap run, 95% of the sentences were randomly selected for training while 
the remaining 5% were used for testing (Argamon et al. 2007). The average results 
across the 50 bootstrap runs were reported for each experimental condition. 
Performance was evaluated using standard metrics for affect analysis, which include 
the mean percentage error and the correlation coefficient (Mishne and Rijke 2006): 

X (*-*)(?-? 



100 



Mean % Error = ^|x-y| Corr(X,T) 
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Table 11.4 Overall results for various feature sets 



Features/test bed 


Mean % 


error 


Correlation coefficient 


SF 


MEF 


SF 


MEF 


N-grams 


4.6360 


3.8066 


0.6627 


0.7455 


SO 


5.0725 


4.4742 


0.4558 


0.5308 


WNet 


4.9646 


4.5507 


0.4952 


0.5122 


ML 


4.9767 


4.6147 


0.5388 


0.4121 


All 


4.8176 


4.3522 


0.6238 


0.6036 



where x and y are the actual and predicted intensity values for one of the n testing 
instances denoted by the vectors X and Y. 



4.3 Experiment 1: Comparison of Feature Sets 

In this experiment, we compared generic n-grams with semantic orientation (SO), 
WordNet model (WNet), and the manual lexicon (ML). We also constructed an 
extended feature set comprised of n-grams, SO, WNet, and ML (labeled “All”). All 
feature sets were evaluated using the support vector regression correlation ensemble 
(SVRCE). SVRCE was run using a linear kernel. N-grams were selected using the 
information gain heuristic applied at the affect level, as outlined in Sect. 3.2.2. The 
information gain was applied to the training data during each of the 50 bootstrap 
instances; these features were then used to train the SVRCE classifiers used on the 
testing data. This resulted in 16 n-gram feature subsets (one for each affect class 
across the four test beds) and a corresponding SVRCE model for each feature subset. 
SO and WNet were run using the formulas described in Sects. 2. 1 and 3.2.2. For SO, 
WNet, and ML, the word-level scores were computed for each sentence, resulting 
in a vector of scores for each sentence. Since different paradigm/seed words were 
used for each affect across all four test beds, the lexicon methods also generated 16 
sets of sentence vectors each. Consistent with Mishne (2005), these vectors were 
treated as features input into the SVRCE. For the “All” feature set, the lexicon sentence 
vectors were merged with the n-gram frequency vectors. 

Table 1 1 .4 shows the macrolevel experimental results for the mean percentage 
error and correlation coefficients across the five feature sets applied to the two test 
beds. The values shown were averaged across the four affect classes used within 
each test bed. The test bed labels correspond to the abbreviations presented in 
Table 11.3 under the column “Test bed name.” The n-gram features appeared to 
have the best performance, with the lowest mean percentage error and highest cor- 
relation coefficient for all test beds. The automated (i.e., SO and WNet) and manual 
lexicons all had fairly similar performance, with mean errors typically in the 5-7% 
range and correlation coefficients between 0.2 and 0.5. As anticipated, the use of all 
features performed well, outperforming the use of individual lexicons. Surprisingly 
however, using all features (i.e., n-grams in conjunction with lexicons) did not out- 
perform the use of n-grams alone. N-grams outperformed the extended feature set 
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Table 11.5 Results for experiment 2 (comparison of techniques) 



Techniques/test bed 


Mean % 


error 


Correlation coefficient 


SF 


MEF 


SF 


MEF 


SO 


8.6590 


14.8759 


0.4673 


0.2530 


WNet 


5.9899 


8.6639 


0.5837 


0.5224 


ML 


6.7270 


8.3860 


0.5500 


0.5251 


SVRCE 


4.6360 


3.8066 


0.6627 


0.7455 


Pace 


6.3038 


5.8473 


0.5692 


0.6124 



by as much as 0.5% and 0. 14 on mean error and correlation coefficient, respectively. 
This suggests that the learned n-grams were able to effectively represent affective 
patterns in the text. Adding lexicon features introduced redundancy and, in some 
instances, noise. Further elaboration regarding the performance of n-grams in 
comparison with other feature sets is provided in the hypotheses testing section. 



4.4 Experiment 2: Comparison of Techniques 

The SVRCE method was compared against scoring and machine learning methods 
used in prior studies. The comparison techniques included Pace regression (Mishne 
and Rijke 2006), WordNet (WNet) scores (Kim and Hovy 2004; Mishne 2005), the 
pointwise mutual information scores from the semantic orientation (SO) approach, 
and the scores from our manual lexicon (ML). For SO, WNet, and ML, the average 
word-level intensities were used as the sentence-level scores as done in prior affect 
analysis research (Subasic and Huettner 2001; Grefenstette et al. 2004a; Read 2004; 
Cho and Lee 2006). SVRCE and Pace regression were both run using the n-gram 
features. N-grams were used since they had the best performance in experiment 1. 
Both techniques (i.e., SVRCE and Pace) were run using identical features, with 
each using 16 feature subsets selected using the information gain heuristic as 
described in experiment 1 . Any scores outside the 0-1 range were adjusted to fit the 
possible range of intensities (this was done in order to avoid inflated errors stem- 
ming from values well outside the feasible range). 

Table 11.5 shows the macrolevel experimental results for the mean percentage 
error and correlation coefficients across the five techniques. The SVRCE method 
had the best performance, with the lowest mean percentage error and highest 
correlation coefficient for all four test beds. Pace regression, WordNet (WNet) models, 
and the manual lexicon (ML) scoring methods were all in the middle, while the 
semantic orientation scoring method had the worst performance. The results are 
consistent with prior research that has also observed large differences between the 
word-level scores assigned using WNet and SO (Mishne 2005). The machine learn- 
ing methods (SVRCE and Pace) both fared well with respect to their correlation 
coefficients. Pace also performed well on the supremacist and Middle Eastern 
forums in terms of mean percentage error, but not on the blogs test bed (LJ). 
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Table 11.6 Results for experiment 3 (ablation testing) 



Techniques/test bed 


Mean % 


error 


Correlation coefficient 


SF 


MEF 


SF 


MEF 


SVRCE 


4.6360 


3.8066 


0.6627 


0.7455 


SVRE 


5.0776 


4.0667 


0.5990 


0.7231 


SVR 


5.7676 


5.0460 


0.5631 


0.5757 



4.5 Experiment 3: Ablation Testing 

Ablation testing was performed to evaluate the effectiveness of the different SVRCE 
components. The SVRCE was compared against a support vector regression ensem- 
ble (SVRE) that does not utilize correlation information, as well as a support vector 
regression classifier using only a single feature set (SVR). The SVR was trained 
using a single feature set (for each test bed) selected by using all n-grams occurring 
at least five times in the corpus (Jiang et al. 2004). The SVRE and SVRCE were 
both run using information gain on the training data to select the 16 feature subsets 
most representative of each affect class. The experiment was intended to evaluate 
the two core components of SVRCE: (1) its use of feature ensembles to better rep- 
resent affective content; (2) the use of correlation information for enhanced affect 
classification. Table 11.6 shows the macrolevel results for the mean percentage error 
and correlation coefficients for SVRCE, SVRE, and SVR. 

The SVRCE method had the best performance, with the lowest mean percentage 
error and highest correlation coefficient for all test beds. SVRCE marginally outper- 
formed SVRE, while both techniques outperformed SVR. The results suggest that 
use of feature ensembles and correlation information are both useful for classifying 
affective intensities. 



4.6 Hypotheses Results 

We conducted pairwise t tests on the 50 bootstrap runs for all three experiments. 
Given the large number of comparison conditions, a Bonferroni correction was 
performed to avoid spurious positive results. All P values less than 0.0005 were 
considered significant at alpha = 0.01. 



4.6.1 HI: Feature Comparison 

Table 11.7 shows the results for pairwise t tests conducted to compare the effective- 
ness of the extended and n-gram feature sets with other feature categories. 

N-grams and the extended feature set both significantly outperformed the lexicon- 
based representations on all test beds with respect to mean error and correlation (all 
P values<0.0001). Surprisingly, the extended feature set did not outperform n-grams. 
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Table 11.7 P values for pairwise t tests (n = 50) on feature comparisons 



Comparison features 


Mean % error 


Correlation coefficient 


SF 


MEF 


SF 


MEF 


All vs. n-gram 


<0.0001* 


<0.0001* 


<0.0001* 


<0.0001* 


All vs. SO 


<0.0001 


<0.0001 


<0.0001 


<0.0001 


All vs. WNet 


<0.0001 


<0.0001 


<0.0001 


<0.0001 


All vs. ML 


<0.0001 


<0.0001 


<0.0001 


<0.0001 


N-gram vs. SO 


<0.0001 


<0.0001 


<0.0001 


<0.0001 


N-gram vs. WNet 


<0.0001 


<0.0001 


<0.0001 


<0.0001 


N-gram vs. ML 


<0.0001 


<0.0001 


<0.0001 


<0.0001 



* Result contradicts hypothesis 

Table 11.8 Sample learned n-grams and lexicon items for hate affect 



Learned n-grams 



Category 


N-gram 


Lexicon items 


Character n-grams 


uck, ck, fuc 


awful, stupid, terrible, 


Word n-grams 


terribly, suck, the stupid, 
the s**t, the f**k 


sicken, s**t, f**k 


Hapax and dis legomena 
collocations 


HAPAX so awful 




POS tag n-grams 


PERSON_SG, WEEKDAY. 
NNP, TIME.SG 





Table 11.9 P values for pairwise t tests (n = 50) on technique comparisons 



Comparison features 


Mean % 


error 


Correlation coefficient 


SF 


MEF 


SF 


MEF 


SVRCE vs. SO 


<0.0001 


<0.0001 


<0.0001 


<0.0001 


SVRCE vs. WNet 


<0.0001 


<0.0001 


<0.0001 


<0.0001 


SVRCE vs. ML 


<0.0001 


<0.0001 


<0.0001 


<0.0001 


SVRCE vs. Pace 


<0.0001 


<0.0001 


<0.0001 


<0.0001 



In contrast, the n-gram feature set significantly outperformed the use of all features 
(n-grams plus the three lexicons), with all P values significant at alpha=0.01. 

Table 1 1.8 provides examples of learned n-grams taken from the LiveJoumal test 
bed for the hate affect. 

It also shows some related hateful items from the manual lexicon. The n-grams 
were able to learn many of the concepts conveyed in the lexicon. Furthermore, 
the n-grams were able to provide better context for some features and also learn 
deeper patterns in several instances. For example, the hate in LiveJournal blogs is 
often directed toward specific people and frequently involves places and times. This 
pattern is captured by the POS tag n-grams. In contrast, word lexicons cannot accu- 
rately represent such complex patterns. The example illustrates how the n-gram 
features learned were more effective than the lexicons employed in this study. 
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Table 11.10 P values for pairwise t tests (n = 50) on ablation testing 



Comparison features 


Mean % 


error 


Correlation coefficient 


SF 


MEF 


SF 


MEF 


SVRCE vs. SVRE 


<0.0001 


<0.0001 


<0.0001 


0.0013 


SVRCE vs. SVR 


<0.0001 


<0.0001 


<0.0001 


<0.0001 


SVRE vs. SVR 


<0.0001 


<0.0001 


<0.0001 


<0.0001 



4.6.2 H2: Technique Comparison 

As shown in Table 1 1 .9, the S VRCE method significantly outperformed all four 
comparison techniques on mean percentage error and correlation coefficient across 
all four test beds. All P values were less than 0.0005 and therefore significant at 
alpha = 0.01. The results indicate that the SVRCE method’s use of ensembles of 
learned n-gram features combined with affect correlation information allows the 
classifier to assign affect intensities with greater effectiveness than comparison 
approaches used in prior research. 



4.6.3 H3: Ablation Tests 

Table 11.10 shows the P values for pairwise t tests conducted to assess the contribu- 
tion of the major components of the SVRCE method. The results of SVRCE versus 
SVRE revealed that the use of correlation information significantly enhanced per- 
formance on most test beds (significant for three out of four test beds on mean error 
and correlation). The results were not significant for mean error on the LiveJournal 
blog test bed (P value = 0.3452) as well as for correlation on the Middle Eastern 
forum dataset (P value = 0.0013). Both SVRCE and SVRE also significantly outper- 
formed SVR, indicating that the use of feature ensembles is effective for classifying 
affect intensities (all P values less than 0.0001, significant at alpha = 0.01). 



5 Case Study: Al-Firdaws vs. Montada 

Many prior studies have used brief case studies to illustrate the utility of their proposed 
affect analysis methods (Subasic and Huettner 2001 ; Mishne and Rijke 2006). In order 
to demonstrate the usefulness of the SVRCE method coupled with a rich set of learned 
n-grams, we analyzed the affective intensities in two popular Middle Eastern web 
forums: www.alfirdaws.org/vb and www.montada.com. Analysis of affects in such 
forums is important for sociopolitical reasons and to better our understanding of social 
phenomena in online communities. Al-Firdaws is considered a more extreme forum 
by domain experts, with considerable content dedicated to support the Iraqi insurgency 
and al-Qaeda. In contrast, Montada is a general discussion forum with content and 
discussion pertaining to various social matters. We hypothesized that our SVRCE 
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Table 11.11 Summary statistics for the two web forums collected 



Forum 


No. of 
authors 


No. of 
threads 


No. of 
messages 


No. of sentences 


Duration 


Al-Firdaws 

Montada 


2,189 

31,692 


14,526 
1 14,965 


39,775 

869,264 


244,917 

2,052,511 


January 2005-July 2007 
September 2000-July 
2007 



Al-Firdaws Posts By Month 
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Fig. 11.3 Posting frequency for the two web forums 



method would be able to effectively depict the likely intensity differences between 
these two web forums for appropriate affect classes. 

We used spidering programs to collect the content in both web forums. Table 11.11 
shows summary statistics for the content collected from the two forums. The 
Montada forum was considerably larger, with over 31,000 authors and a large num- 
ber of threads and postings, partially because it had been around for approximately 
7 years. Al-Firdaws was a relatively newer forum, beginning in 2005. Due to the 
nature of its content and time duration of existence, this forum had fewer authors 
and postings. 

Figure 11.3 shows the number of posts for each month the forums have been active. 
Montada was very active in 2002 and 2005, with over 20,000 posts in some months, 
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Table 11.12 Affect intensities per posting across two web forums 



Intensity 


Forum 


Violence 


Anger 


Hate 


Racism 


Average per message 


Al-Firdaws 

Montada 


0.084 

0.027 


0.018 

0.012 


0.037 

0.010 


0.032 

0.014 


Total per message 


Al-Firdaws 

Montada 


0.523 

0.246 


0.127 

0.105 


0.178 

0.092 


0.191 

0.134 



Montada- 

Violence 



Montada-Hate 



Fig. 11.4 Temporal view of intensities in two web forums 

yet appears to be in a down phase in 2007 (similar to 2004). Al-Firdaws consistently 
had between 2,500 and 3,000 posts per month since the second half of 2006. 

The SVRCE classifier was employed in conjunction with the n-gram feature set 
to analyze affect intensities in the two web forums. Analysis was performed on 
violence, hate, racism, and anger affects. We computed the average posting level 
intensities (averaged across all sentences in a posting) as well as the total intensity 
per post (the summation of sentence intensities in each posting). The analysis was 
performed on all postings in each forum (approximately 900,000 postings and 2.3 
million sentences). As shown in Table 11.12, the Al-Firdaws forum had consider- 
ably higher affect intensities for all four affect classes, usually 2-3 times greater 
than Montada. 

Figure 1 1 .4 depicts the average message violence and hate intensities over time 
for all postings in each of the two web forums. The x-axis indicates time, while the 
y-axis shows the intensities (on a scale of 0-1). Each point represents a single mes- 
sage; areas with greater message concentrations are darker. The blank periods in the 
diagrams correspond to periods of posting inactivity in forums (see Fig. 1 1.3 for 
correspondence). Based on the diagram, we can see that Al-Firdaws has consider- 
ably higher violence and also greater hate intensity across time. Al-Firdaws also 
appears to have increasing violence intensity in 2007 (based on the concentration of 
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postings), possibly attributable to the increased activity in this forum. In contrast, 
violence and hate intensities are consistently low in Montada. The results generated 
using SVRCE and n-gram features are consistent with existing knowledge regard- 
ing these two forums. The case study illustrates how the proposed features and 
techniques can be successfully applied toward affect analysis of computer-mediated 
communication text. 



6 Conclusions 

In this chapter, we evaluated various features and techniques for affect analysis of 
online texts. In addition, the support vector regression correlation ensemble 
(SVRCE) was proposed. This method leverages an ensemble of SVR classifiers 
with each constructed for a separate affect class. The ensemble of predictions 
combined with the correlation between affect classes is leveraged for enhanced 
affect classification performance. Experimental results on test beds derived from 
online forums, blogs, and stories revealed that the proposed method outperformed 
existing affect analysis techniques. The results also suggested that learned n-grams 
can improve affect classification performance in comparison with lexicon-based 
representations. However, combining n-gram and lexicon features did not improve 
performance due to increased amounts of noise and redundancy in the extended 
feature set. A case study was also performed to illustrate how the proposed features 
and techniques can be applied to large cyber communities in order to reveal affective 
tendencies inherent in these communities’ discourse. To the best of our knowledge, 
the experiments conducted in this study are the first to evaluate features and tech- 
niques for affect analysis. Furthermore, we are also unaware of prior research 
applied to such a vast array of domains and test beds. 

We believe this chapter provides an important stepping stone for future work 
intended to further enhance the feature representations and techniques used for 
classifying affects. Based on this work, we have identified several future research 
directions. We intend to apply the techniques across a larger set of affect classes 
(e.g., 10-12 affects per test bed). We are also interested in exploring additional 
feature representations, such as the use of richer learned n-grams (e.g., semantic 
collocations, variable n-gram patterns, etc.). We also plan to evaluate the effective- 
ness of real-world knowledge bases such as those employed by Liu et al. (2003). 
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Chapter 12 

CyberGate Visualization 



1 Introduction 

Computer-mediated communication (CMC) has seen tremendous growth due to the 
fast propagation of the Internet. Text-based modes of CMC include e-mail, listservs, 
forums, chat, and the World Wide Web (Herring 2002). These modes of CMC 
continue to have a profound impact on organizations due to their quick and ubiqui- 
tous nature. Electronic communication methods have redefined the fabric of organi- 
zational culture and interaction. With the persistent evolution of communication 
processes (Welck 1987) and constant advancements in technology, such metamor- 
phoses are likely to continue. One such trend has been the increased use of online 
communities to support business operations (Wenger and Snyder 2000). Business 
online communities provide invaluable mechanisms for various forms of interaction 
including knowledge dissemination, transfer of goods and services, and product/ 
service reviews (Cothrel 2000). Electronic communities (Wenger and Snyder 2000) 
and networks (Wasko and Faraj 2005) of practice allow companies to tap into the 
wealth of information and expertise available across corporate boundaries. Internet 
marketplaces enable the efficient transfer of goods and services in cyberspace while 
consumer rating forums have also emerged, providing product reviews that can be 
equally useful to potential customers and marketing departments (Turney and 
Littman 2003; Pang et al. 2002). 

In spite of the numerous benefits of electronic communication, it is not without 
its pitfalls. Two characteristics of computer-mediated communication have proven 
to be particularly problematic: online anonymity and the enormity of data present 
in cyber communities. These vices undermine the numerous benefits associated 
with CMC and online business communities (Davenport 2002). 

The anonymous nature of the Internet has resulted in several trust-related issues 
including online deception (Nissenbaum 1996; Friedman et al. 2000). In addition to 
deceit, cyberspace is crawling with agitators (also referred to as “trolls”) that attempt 
to peeve community members and disrupt online discourse (Donath 1999). 



H. Chen, Dark Web: Exploring and Data Mining the Dark Side of the Web , 
Integrated Series in Information Systems 30, DOI 10.1 007/978- 1 -46 14-1 557-2_l 2, 
© Springer Science+Business Media, LLC 2012 



227 



228 



12 CyberGate Visualization 



Newsgroups and knowledge exchange communities also suffer from lurkers that 
free ride off of others (Smith 2002; Wasko and Faraj 2005). Collectively, these 
concerns can cast serious doubts onto the quality of information exchanged in such 
online communities (Davenport 2002; Viegas and Smith 2004; Wasko and Faraj 
2005). Cyber communities also contain large volumes of information, including 
various communication modes, topics, threads, messages, and authors. CMC environ- 
ments feature very large-scale conversations involving thousands of people and 
messages (Sack 2000; Herring 2002). The enormous information quantities make 
such places noisy and difficult to navigate (Viegas and Smith 2004). 

Hence, there is a need for techniques to evaluate, summarize, and present computer- 
mediated communication content. Many believe that the solution is to develop systems 
that support navigation and knowledge discovery (Wellman 2001). Such CMC 
information systems can enhance informational transparency which benefits 
community participants and observers (Sack 2000; Kelly et al. 2002). Erickson and 
Kellogg (2000) argued that tools supporting social translucence in cyberspace would 
improve participant accountability. Smith (1999) suggested that methods for providing 
social accounting data could be mutually beneficial to online community members 
and researchers. Consequently, numerous CMC information systems have been 
developed in order to address these needs (Xiong and Donath 1999; Fiore and Smith 
2002; Viegas et al. 2004; Viegas and Smith 2004). These techniques generally visu- 
alize data provided in the message headers such as interaction- (send/reply struc- 
ture) and activity-based (posting patterns) information. Little support is provided 
for analysis of textual information contained in the message bodies. In the instances 
where text analysis is provided, simple feature representations such as those used in 
information retrieval systems (Sack 2000; Whitelaw and Patrick 2004) are utilized, 
e.g., bag-of-words (Mladenic 1999). 

In addition to topical information, online discourse is also rich in social cues 
including emotion, opinion, style, and genre (Yates and Orlikowski 1992, 2002; 
Henri 1992; Hara et al. 2000). There is a need for improved CMC system content 
analysis capabilities (Paccagnella 1997) based on a richer textual representation. This 
requires the incorporation of a complex set of features, techniques, and visual repre- 
sentations to facilitate enhanced text analysis. Unfortunately, text analysis features 
and visualization techniques are not well defined. Consequently, there is a need for a 
set of design guidelines for CMC systems supporting text analysis (Sack 2000). 

This chapter proposes a design framework for the creation of CMC systems that 
can provide improved text analysis capabilities by incorporating a richer set of textual 
information types. Our framework addresses several important issues from the text 
mining literature en route to the development of a set of guidelines geared toward 
CMC text analysis. We then develop the CyberGate system based on our design 
framework. CyberGate includes the Writeprint and Ink Blot techniques that can be 
used for analysis and categorization of CMC text. 

The remainder of this chapter is organized as follows. Section 2 provides a review 
of CMC systems developed to support content analysis. This section outlines the 
need for systems to/that support CMC textual analysis. Section 3 reviews some of 
the relevant text mining issues related to CMC text analysis. We then present a 
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design framework for CMC text analysis in Sect. 4 that takes into consideration the 
text mining concerns outlined in the previous section. Section 5 uses the framework 
to develop CyberGate: a CMC text analysis system that features the Writeprint and 
Ink Blot techniques. In Sect. 6, we present two sample cases of how CyberGate can 
be used to help with Dark Web forum content and author attribution. The case 
studies involve two major extremist forums: Clearguidance.com and Ummah.com. 
We conclude this chapter in Sect. 7. 



2 Background 
2.1 CMC Content 

Several dimensions have been proposed for CMC content analysis (Henri 1992; 
Hara et al. 2000). These include participation/activity, interaction, social cues, top- 
ics, emotions, roles, linguistic variation, question types, response complexity, etc. 
(Henri 1992; Rourke et al. 2001). The information utilized for CMC content analy- 
sis can be broadly categorized as either structural or textual, as shown in Fig. 12.1. 
Structural features of CMC content include all attributes based on communication 
topology. These features are extracted from message headers, without any use of the 
message body (Sack 2000). Structural features support activity and interaction anal- 
ysis. Posting activity-related features include number of posts, number of initial 
messages, number of replies, number of responses to a particular author’s posts, etc. 
(Fiore et al. 2002). These features can be used to represent an author’s social account- 
ing metrics (Smith 2002). Activity/participation-based attributes and analysis can 
provide insight into different roles such as debaters, experts, etc. (Zhu and Chen 2001; 




Fig. 12.1 CMC content analysis model 
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Viegas and Smith 2004). While activity analysis looks at individuals independent 
of their fellow participants, interaction analysis is specifically concerned with con- 
tact-based/interpersonal information. This type of analysis typically involves the 
construction of social networks based on who is talking to whom (Sack 2000; Smith 
and Fiore 2001). Such analysis can help identify key members and relationships by 
utilizing social network analysis metrics such as centrality and link densities. 
Interaction analysis can also provide insight into important roles such as major/ 
minor discussants, reply sources, and reply sinks (Smith and Fiore 2001). 

Textual features include all attributes derived from the message body. Although 
the informational richness of CMC text was previously questioned (Daft and Lengel 
1986), numerous studies have since demonstrated the richness of CMC content 
(Yates and Orlikowski 1992, 2002; Lee 1994; Panteli 2002). Thus, in additional to 
topical information and events (e.g., Allan et al. 1998), textual online discourse con- 
tains several types of information including emotions (Picard 1997; Subasic and 
Huettner 2001), opinions (Hearst 1992), style (Abbasi and Chen 2006; Zheng et al. 
2006), and genres (Santini 2004). These additional forms of information can support 
several forms of analysis including social cues (Spears and Lea 1992, 1994), power 
cues (Panteli 2002), and genre analysis (Yates and Orlikowski 1992, 2002). Social 
cues include textual elements not related to formal content or subject matter (Henri 
1992). Hara (2000) provided examples of social cues that included self-introductions, 
expressions of feeling, greetings, signatures, jokes, use of symbolic icons, and com- 
pliments. Text also contains evidence of power cues, where the style of messages 
differs depending upon whether the interaction is between participants at the same 
or different levels within an organization (Panteli 2002). Genre theory identified 
types of writing based on purpose and form (e.g., memos, meetings, reports, etc.) 
and demonstrated how different genres served as a source of organizing structures 
and communicative norms (Yates and Orlikowski 1992; Yates et al. 1999). 



2.2 CMC Systems 

CMC systems can be categorized into two categories based on functionality: those 
that support the communication process and those that support analysis of commu- 
nication content (Sack 2000). While it is certainly possible for a single system to 
support both functions (e.g., Erickson and Kellogg 2000), we focus on only the 
analysis functionalities provided by these systems due to its relevance to CMC 
content analysis. Table 12.1 provides a review of previous CMC systems based on 
the type of analysis features included. 

A plethora of CMC systems have been developed to support structural features. 
Several tools visualize posting activity patterns, such as Loom (Donath et al. 1999) 
and Authorlines (Viegas and Smith 2004). PeopleGarden and Communication 
Garden both use garden metaphors with flower glyphs to display author and thread 
activity (Xiong and Donath 1999; Zhu and Chen 2001). The number of petals, number 
of thorns, petal colors, and stem lengths are used to represent activity features such 
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Table 12.1 Previous CMC systems 



System name 


References 


Feature types 
Struct. Text 


Feature descriptions 


Chat Circles 


Donath et al. (1999) 


V 


V 


Length, headers 


Loom 


Donath et al. (1999) 


V 


V 


Terms, punctuation, headers 


PeopleGarden 


Xiong and Donath (1999) 


V 




Headers 


Babble 


Erickson and Kellogg (2000) 


V 




Headers 


Conversation Map 


Sack (2000) 


V 


V 


Semantic, headers 


Communication 


Zhu and Chen (2001) 


V 


V 


Noun phrases, headers 


Garden 


Coterie 


Donath (2002) 


V 




Headers 


Newsgroup 


Fiore and Smith (2002) 


V 




Headers 


Treemaps 


PostHistory 


Viegas et al. (2004) 


V 




Headers 


Social Network 


Viegas et al. (2004) 


V 




Headers 


Fragments 


Authorlines 


Viegas and Smith (2004) 


V 




Headers 


Newsgroup Crowds 


Viegas and Smith (2004) 


V 




Headers 



as total number of posts and the number of threads an author has been active in. 
Babble (Erickson and Kellogg 2000) and Coterie (Donath 2002) are both geared 
toward showing activity patterns in persistent conversation. In these systems, all 
participants are displayed in a two-dimensional space. More active authors are 
shown in the center, while participants with fewer postings gradually shift toward 
the outside. The visual effect represents a good method for identifying active 
participants versus lurkers and free riders (Donath 2002). Chat Circles (Donath 
et al. 1999) presents recently posted messages as bubbles that fade over time as newer 
content is displayed. In contrast to systems providing activity-based functions, 
systems displaying interaction information have also been developed. Conversation 
Map visualizes social networks based on send/reply patterns (Sack 2000). Netscan 
displays message and author interactions (Smith and Fiore 2001; Smith 2002), 
while Loom shows thread-level interaction structures (Donath et al. 1999). 

Previous CMC systems offer limited support for text features. Loom (Donath 
et al. 1999) shows some content patterns based on message moods. The moods are 
assigned based on the occurrence of certain terms and punctuation in the message 
text. Chat Circles (Donath et al. 1999) displays messages based on body text length. 
Conversation Map (Sack 2000) and Communication Garden (Zhu and Chen 2001) 
provide more in-depth topical analysis. Conversation Map uses computational lin- 
guistics to build semantic networks for discussion topics, while Communication 
Garden performs topic categorization based on noun phrases. Overall, the features 
used in CMC systems are insufficient to effectively capture textual content in online 
discourse (Sack 2000; Whitelaw and Patrick 2004). Text systems are a related 
class of systems that are used for information retrieval. However, information 
retrieval (IR) systems are more concerned with information access than analysis 
(Hearst 1999). Mladenic (1999) presented a review of 29 IR systems, all of which 
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used bag-of-words to represent textual features. Thus, text-based information 
systems (TBIS) for supporting in-depth text analysis of CMC content remain 
nonexistent. 



2.3 Need for CMC Systems Supporting Text Analysis 

Numerous CMC researchers and analysts have stated the need for tools to support 
computer-mediated communication text analysis. Textual features are important yet 
often overlooked in e-mail analysis (Panteli 2002). Cothrel (2000) stated that dis- 
cussion content is an essential dimension of online community success measure- 
ment, yet proper definition and measurement remains elusive. Hara et al. (2000) 
noted that there has been limited CMC content analysis since manual methods are 
time consuming. For instance, features such as use of greetings and signatures, 
which may be important power cues, can easily be captured using stylistic feature 
extractors (Zheng et al. 2006). Manual extraction of textual features can be time 
consuming and typically results in smaller data samples used for experimentation. 
Paccagnella (1997) also suggested that computer programs to support CMC text 
analysis would be helpful, yet do not exist. He noted numerous ways in which text 
analysis systems could benefit CMC content analysis. Based on his recommenda- 
tions, six of the pertinent tasks that CMC text analysis systems could support are 
listed below: 

1 . Data linking: connecting relevant data segments to each other, forming categories, 
clusters, or networks of information. 

2. Content analysis: counting frequencies, sequences, or locations of words and 
phrases. 

3. Data display: placing selected or reduced data in a condensed, organized format, 
such as a matrix or network, for inspection. 

4. Conclusion-drawing and verification: aiding the analyst to interpret displayed 
data and to test or confirm findings. 

5. Theory-building: developing systematic, conceptually coherent explanations of 
findings; testing hypotheses. 

6. Graphic mapping: creating diagrams that depict findings or theories. 

Given the need for CMC text analysis and lack of systems that address this 
need, an important and obvious question arises. Why do most CMC systems sup- 
port structural information but not textual content features? There are three major 
differences between structural and textual features that are likely responsible for 
the disparity between the numbers of systems representing these feature types. 
These three factors are feature definitions, extraction, and presentation. Structural 
features are well defined, easy to extract, and easy to visualize. Activity- (Fiore 
et al. 2002) and interaction-based features and metrics have been well defined in 
the sociology literature. Furthermore, posting activity features and interaction fea- 
tures (e.g., network metrics) can each easily be extracted from message headers. 
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The extracted features are typically visualized using bar chart variants for activity 
frequency (e.g., Xiong and Donath 1999; Zhu and Chen 2001; Viegas and Smith 
2004) and networks for interaction (e.g., Donath et al. 1999; Smith and Fiore 2001). 
In contrast, rich textual features are not well defined, difficult to extract, and harder 
to present to end users. Text categorization requires a complex set of subjective 
features (Donath et al. 1999). For example, over 1,000 features have been used for 
analyzing style, with no consensus (Rudman 1998). Additionally, text feature extrac- 
tion can be challenging due to high levels of noise in online discourse text (Knight 
1999; Nasukawa and Nagano 2001). Finally, text presentation requires the use of 
multiple presentation views (Losiewicz et al. 2000) since standard visualization 
techniques may not apply to text (Keim 2002). Consequently, different techniques 
have been developed to support various facets of text visualization (Wise 1999; 
Miller et al. 1998; Rohrer et al. 1998; Huang et al. 2005) with no ideal solution. 

In light of these challenges, Sack (2000) argues for a new CMC system design 
philosophy that incorporates automatic text analysis techniques. He states “...it is 
necessary to formulate a complementary design philosophy for CMC systems in 
which the point is to help participants and observers spot emerging groups and 
changing patterns of communication...” (p. 86). Design guidelines are needed 
because of the lack of previous tools for CMC textual analysis, complexity of text 
analysis tasks, and the fact that guidelines for appropriate features, feature selection, 
and presentation styles are not well defined. Hence, a design framework for 
text-based information systems must address the aforementioned text mining issues. 
A review of each of these issues is presented in the following section. 



3 CMC Text Mining 

Text mining is concerned with the process of extracting interesting information and 
knowledge from unstructured/free text. In order to develop a set of design guide- 
lines for text-based information systems, we first present a review of the relevant 
text mining elements pertaining to text analysis systems, which include: tasks, 
information types, features, feature selection methods, and visualization. 



3.1 Tasks 

Categorization and analysis are two important text mining tasks (Lewis 1992; 
Tan 1999). 

Categorization refers to the assignment of text to predefined categories based on 
content (Dumais et al. 1998; Chen 2001). This includes classification and clustering 
operations. Analysis is concerned with the trends, patterns, and comparisons derived 
from text (Hearst 1999; Nasukawa and Nagano 2001). Categorization tasks use tex- 
tual feature occurrence frequencies, similarities, and variations for classification or 
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clustering, while analysis tasks use that information for identification and assessment 
of trends and patterns. For CMC text analysis, categorization and analysis are both 
important. Categorization enables the classification of messages by different 
information types, while analysis functionalities provide support for analyzing 
patterns and trends across various dimensions including forums, threads, authors, 
messages, and time. The categorization and analysis tasks correspond to the data- 
linking and content analysis functions that Paccagnella (1997) suggested CMC 
systems should support. 



3.2 Information Types 

Systemic Functional Linguistic Theory states that language has three kinds of mean- 
ing: ideational, interpersonal, and textual (Flalliday 1994). Ideational means that 
language consists of ideas. Textual indicates that language has organization and 
structure. Interpersonal refers to the fact that language is a medium of exchange. 
The interpersonal dimension of language is effectively covered by social/interaction 
networks (Sack 2000). However, text-based information systems should incorporate 
a wide range of ideational and textual information. Examples of ideational informa- 
tion types found in text include topics, events, opinions, and emotions. 

Topical information is the most commonly represented form of text information. 
It is supported by all information retrieval systems (Mladenic 1999). Common 
features used to represent topics include bag-of- words, noun phrases, and named enti- 
ties. For example, HelpfulMed (Chen et al. 2003) creates topical categorization maps 
based on the noun phrases extracted from the content of medical document corpora. 
In contrast, events are specific topics/incidents with a temporal dimension. While 
“hurricane” is a topic, “Hurricane Katrina” is an event. Event detection/tracking has 
garnered significant attention in recent years. Although event detection and tracking 
are important areas of text analysis, they continue to present challenges, and effective 
features and techniques remain elusive (Allan et al. 1998). 

Additional forms of ideational information include opinions and emotions. 
Opinions include sentiments about a particular topic, such as agonistic, neutral (no 
opinion), or antagonistic (Hearst 1992). Popular applications of opinion-related 
information include mining movie and product review sites (Turney and Littman 
2003). Text is also rich in emotional information (Picard 1997). Emotions encom- 
passed in online communication include various affects such as happiness, horror, 
anger, etc. (Subasic and Huettner 2001). 

Language also contains textual features such as organization and structure. 
Examples of textual information types include style, genres, and vernaculars. Style 
includes numerous stylistic attributes, including vocabulary richness, word choice, 
and punctuation usage (Argamon et al. 2003; Abbasi and Chen 2006). Example 
styles include formal (use of greetings, structured sentences, paragraphs), informal 
(no sentences, no greetings, erratic punctuation usage), etc. Style is based on 
the literary choices an author makes, which can be a reflection of his/her context 
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Table 12.2 Text information types 



Information type 


Examples 


Analysis types 


References 


Ideational 


Topics 


Topical analysis 


Mladenic (1999), 
Chen et al. (2003) 




Events 


Event detection 


Allan et al. (1998) 




Opinions 


Sentiment analysis 


Hearst (1992), Turney and 
Littman (2003) 




Emotions 


Affect analysis 


Picard (1997), Subasic and 
Huettner (2001) 


Textual 


Style 


Authorship analysis 
Deception detection 
Power cues 


Argamon et al. (2003), Abbasi 
and Chen (2006), Zhou 
et al. (2004), Panteli (2002) 




Genres 


Genre analysis 


Yates and Orlikowski (1992, 
2002), Santini (2004) 




Metaphors/ 

vernaculars 


Semantic networks 


Sack (2000) 



(who, what, when, why, where) and personal background (education, gender, etc.). 
Stylistic information is utilized in numerous forms of analysis. Authorship analysis 
attempts to identify and characterize individuals based on their writing style 
(Argamon et al. 2003; Abbasi and Chen 2006), deception detection attempts to 
determine if an individual’s writing is deceitful (Zhou et al. 2004), while power cue 
identification looks at the stylistic differences in writing between superiors and sub- 
ordinates in organizational settings (Panteli 2002). Genres are classes of writing. 
Genres found in an organizational communication setting include inquiries, infor- 
mational messages, news articles, memos, resumes, reports, interviews, etc. (Yates 
and Orlikowski 1992; Santini 2004). Genres have a profound impact on the struc- 
ture and organization of text in computer-mediated communication. Table 12.2 shows 
example for each information type and their corresponding analysis applications. 



3.3 Features 

Linguistic features can be classified into two broad categories; language resources 
and processing resources (Cunningham 2002). Both categories are often used in 
conjunction to complement each other. Language resources are data-only resources 
such as lexicons, thesauruses, word lists (e.g., pronouns), etc. Such self-standing 
features exist independent of the context and provide powerful discriminatory 
potential. However, language resource construction is typically manual, and fea- 
tures may be less generalizable across information types (Pang et al. 2002). 

Processing resources require programs/algorithms for computation. For instance, 
parts of speech, n-grams, statistical features (e.g., vocabulary richness), and bag-of- 
words are all examples of processing resources since they require processing opera- 
tions before being used. The majority of these features are context-dependent, 
meaning that they change according to the text corpus. However, the extraction 
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procedures/definitions remain constant, making processing resources highly 
generalizable across information types. Consequently, processing resource features 
such as bag-of-words, parts of speech, and n-grams are used to represent numerous 
information types including topics, events, sentiments, emotions, style, and genres 
(Pang et al. 2002; Argamon et al. 2003; Santini 2004). Using language and 
processing resources in conjunction can improve text categorization and analysis 
capabilities since processing resources provide breadth while language resources 
offer depth. 



3.4 Feature Selection 

Given the complexities of language and text content, a large number of potential 
linguistic features are used for categorization and analysis. However, only a subset 
of these features may be relevant or useful for a particular instance (e.g., an author, 
document, message, etc.). Three types of feature selection techniques have been 
identified in previous research (Guyon and Elisseeff 2003), all of which have also 
been used in text mining. These include ranking, projection, and subset selection 
methods. Ranking techniques are those that rank/sort attributes based on some 
heuristic (Duch et al. 2004; Hearst 1999). Examples of ranking techniques include 
information gain, chi-squared, and Pearson’s correlation coefficient (Forman 2003). 
Projection methods are transformation-based techniques that utilize dimensionality 
reduction (Huber 1985; Huang et al. 2005). These include techniques such as 
principal component analysis (PCA), multidimensional scaling (MDS), and self- 
organizing map (Chen et al. 2003; Huang et al. 2005). Subset selection techniques 
select a subset of the original attributes. These techniques typically use search strat- 
egies, including exhaustive and random search, to consider different feature combi- 
nations that comprise subsets of the original feature set (Dash and Liu 1997). Each 
of the three feature selection techniques has its advantages and disadvantages. 

Ranking methods have been used in several text categorization and analysis stud- 
ies. Efron et al. (2004) used information gain to select bag-of-words for topic cate- 
gorization. Abbasi and Chen (2005) used decision tree models to select key features 
in order to analyze the important stylistic differences between web forum authors. 
Pang et al. (2002) used minimum frequency thresholds to filter out n-grams for 
sentiment classification. Ranking methods offer greater explanatory potential since 
they preserve the original feature set and simply sort the features based on some 
heuristic (Seo and Shneiderman 2002). These techniques also offer simplicity and 
scalability; however, they typically only consider an individual feature’s predictive 
power (Guyon and Elisseeff 2003; Li et al. 2006). Thus, important feature interac- 
tions may be lost. 

Projection methods have been used to transform text feature spaces into two- or 
three-dimensional projections for authorship and topical categorization (Allan et al. 
2001; Chen et al. 2003; Abbasi and Chen 2006). Projection methods are highly 
robust against noise, which makes them very useful for text analysis since they can 
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Table 12.3 Feature selection methods applied to text 



Selection method 


Example 


Analysis type 


References 


Ranking 


Information gain 


Topical 


Efron et al. (2004) 




Decision tree model 


Authorship 


Abbasi and Chen (2005) 




Minimum frequency 


Sentiment 


Pang et al. (2002) 


Projection 


Principal component analysis 


Authorship 


Abbasi and Chen (2006) 




Multidimensional scaling 


Topical 


Allan et al. (2001) 




Self-organizing map 


Topical 


Chen et al. (2003) 


Subset selection 


Genetic algorithm 


Authorship 


Li et al. (2006) 



uncover important underlying patterns (Abbasi and Chen 2006). However, the use 
of transformation from the original features to projections results in reduced explan- 
atory potential (Seo and Shneiderman 2002). Projection methods can describe 
important high-level patterns and trends but have difficulty explaining details about 
specific features. 

Subset selection methods provide a high level of power, often outperforming 
other techniques in terms of predictive abilities (Dash and Liu 1997). For example, 
Li et al. (2006) demonstrated the effectiveness of a genetic algorithm for feature 
subset selection for authorship analysis. Subset selection techniques consider fea- 
ture interactions (unlike many ranking methods). However, subset selection tech- 
niques can be computationally inefficient. Such search-based methods are often 
considered to be a “brute force” approach since they simply try numerous feature 
combinations (Guy on and Elisseeff 2003). This is a big concern for systems where 
feature selection operations are likely to see heavy usage (such as text-based systems). 
In addition, the large number of potential features in text analysis can further 
decrease the efficiency of subset selection methods since it results in even larger 
search spaces with greater potential feature combinations. Table 12.3 shows exam- 
ple selection methods that have been applied to text mining and the type of analysis 
performed. Ranking and projection methods have seen greater use due to their sim- 
plicity/efficiency and propensity to handle noise, respectively. 



3.5 Visualization 

Text visualization is challenging since text cannot easily be described by numbers 
(Keim 2002). It requires the use of multiple views, representing different data types 
(Losiewicz et al. 2000), with varying dimensionalities. Wise (1999) noted that text 
analysis should “...provide a basis for altered visualization of the information for 
different users and purposes... why should we preconceive that there is only one 
‘correct’ visualization of text information in a document corpus?” (p. 1230). For 
instance, text itself is one-dimensional, textual features are multidimensional 
(Huang et al. 2005), and the relation between features and the text they represent 
(e.g., structural, semantic) is often established using 2D-3D text overlay (e.g., 
Cunningham 2002). Thus, while several types of text visualizations have been 
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developed, we focus on numeric techniques that visualize text feature statistics and 
text techniques that visualize features as they occur in the text since these two view 
types provide important complementary functionality. 

Multidimensional views are often used to visualize text feature statistics. Such 
statistics, including frequency, variance, and similarity, provide important insight 
and summarization yet abstract away important meaning from the underlying, 
nonnumeric content they are intended to represent. These techniques can tell us 
what features are important, but not how or why. Text overlay techniques serve two 
important functions. They allow users to see exactly how and where the features 
occur within their proper context. They are also important in order to allow users to 
assess the quality of feature extraction and representation (Losiewicz et al. 2000) 
due to the high levels of noise in text (Knight 1999; Nasukawa and Nagano 2001). 
Thus, it is important to incorporate multidimensional presentation formats that can 
summarize feature statistics as well as text formats that can bridge the gap between 
feature statistics and their actual occurrences in text. 

Several multidimensional techniques have been used for text visualization. These 
techniques, including graphs/plots and reduced dimensionality views, are used to 
display feature occurrence statistics and patterns. Several graphs and plots have 
been used including radar charts, parallel coordinates, and scatter plot matrices. 
Subasic and Huettner (2001) and Abbasi and Chen (2005) used radar charts to view 
affect and stylistic feature occurrences, respectively. Huang et al. (2005) used paral- 
lel coordinates and scatter plot matrices to view topical features extracted from 
biomedical text. Reduced dimensionality visualizations decrease the feature space 
to show essential patterns. These techniques typically are used in conjunction with 
projection-based feature selection techniques such as PCA and MDS to create a 
two- or three-dimensional view. Examples include Writeprints (Abbasi and Chen 
2006), ThemeRiver® (Havre et al. 2002), Text Blobs (Rohrer et al. 1998), and 
Themescapes™ (Wise 1999). 

Text overlay combines text with feature occurrence patterns to provide greater 
insight. Examples include the Stereoscopic Document View (Miller et al. 1998) and 
text annotation (Cunningham 2002). The Stereoscopic Document View in Topic 
Islands™ uses wavelet transformations to show the key topical patterns within a 
document, superimposed onto the text (Miller et al. 1998). Text annotation simply 
highlights the feature occurrences in the text (Cunningham 2002). 



4 A Design Framework for CMC Text Analysis 

Design is a product and a process (Walls et al. 1992; Hevner et al. 2004). The design 
product is the set of requirements and necessary design features that should guide 
the IT artifact construction. The design process is the steps and procedures taken to 
develop the artifact. Information systems development typically follows an iterative 
process of building and evaluating (March and Smith 1995), which is analogous to 
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Table 12.4 Components of an 1SDT design product 
Design product 

1 . Kernel theories Theories from natural and social sciences governing design requirements 

2. Meta-requirements Describes a class of goals to which theory applies 

3. Meta-design Describes a class of artifacts hypothesized to meet meta-requirements 

4. Testable hypotheses Used to test whether meta-design satisfies meta-requirements 



the generate/test cycle proposed by Simon (1996). Such an approach is particularly 
important in design situations involving a complex or poorly defined set of user 
requirements (Markus et al. 2002). We believe that the ambiguities associated with 
CMC text analysis component alternatives also warrant the use of such a design 
process. Thus, we focus on the design product elements of Walls et al.’s (1992) 
model, which are presented in Table 12.4. 

Using Walls et al.’s guidelines, we propose a design framework for CMC text 
analysis. Systemic Functional Linguistic Theory illustrates the need for incorporat- 
ing different information types for text analysis, such as ideational and textual infor- 
mation. However, capturing multiple information types for representational richness 
can be challenging. For instance, topical information has a well-established set of 
features that are typically used (i.e., bag-of-words); however, as previously men- 
tioned, over 1,000 features have been used for style alone (Rudman 1998). Opinion 
has also seen the use of lexical, syntactic, lexicon, and structural features. Thus, 
improving representational richness may result in increased complexity with respect 
to the myriad of potential features, selection methods, and visualization techniques 
that can be utilized. 

When dealing with numerous methods with varying strengths and weaknesses, 
methodological triangulation (Denzin 1970) is useful for overcoming the problems 
that may stem from the overt dependency on any one method (Balakrishnan and 
Jacob 1995). In the social sciences, methodological triangulation suggests that 
researchers should use divergent methods for measuring and analyzing constructs 
(Campbell and Fiske 1959). Balakrishnan and Jacob (1995) extended the concept of 
methodological triangulation to be applicable to information systems design guide- 
lines. They designed and developed a Decision Support System (DSS) that incorporated 
multiple complementary search strategies to support product design optimization. 
Their DSS utilized exhaustive search methods which provided high performance 
yet low efficiency and random search methods which provided varying levels of 
performance and improved efficiency. Balakrishnan and Jacob (1995) argued that 
methodological triangulation can improve performance and user confidence, 
resulting in greater system usage. 

In our review of the key CMC text mining characteristics, we described the 
strengths and weakness of different types of features and selection and presenta- 
tion methods. We also noted how different categories of these characteristics 
can complement each other for improved analysis and categorization capabili- 
ties. For instance, the use of language and processing resources may be beneficial 
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as compared to simply using a single feature type. Similarly, ranking and 
projection methods for feature selection can complement each other by facili- 
tating analysis of overview (projection methods) and specific feature details 
(ranking methods). 



4.1 Meta-requirements 

Using Systemic Functional Linguistic Theory and methodological triangulation as 
our kernel guiding principles, we propose several meta-requirements based on our 
review of the key text mining elements. The primary objective of our meta-requirements 
is to improve the representational richness of text analysis systems over the standard 
bag-of-words features. We wish to couple this richer feature set with the appropriate 
selection and visualization techniques to allow support for CMC text analysis: 

1 . Support for various text mining tasks. 

2. Support in-depth content analysis across various information types for improved 
representational richness. 

3. Support large sets of complementary and contrasting textual features. 

4. Feature selection methods to support selection of attributes that best capture the 
underlying content based on user needs. 

5. Visualization techniques to support presentation of overview and details, make 
comparisons, and assess similarity. 



4.2 Meta-design 

Based on our meta-requirements and review of the important categories for each 
element, we propose the following meta-design: 

1. Include categorization and analysis functionality (from Requirement 1). 

2. Incorporate information types to capture ideational and textual language mean- 
ing (from Requirement 2). 

3. Language and processing resources-based textual features (from Requirement 3). 

4. Ranking and projection-based feature selection methods (from Requirement 4). 

5. Basic, multidimensional, and text overlay visualization techniques (from 
Requirement 5). 

Based on our meta-design, we present a framework for the design of text-based 
information systems to support computer-mediated communication content analy- 
sis (shown in Fig. 12.2). While the framework is applicable to all forms of text, we 
focus specifically on computer-mediated communication due to its high level of 
interaction and informational richness. 
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Fig. 12.2 A design framework for CMC text analysis 



5 System Design: The CyberGate System 

Using our design framework as a guideline, we developed a text analysis system for 
CMC analysis called CyberGate. The system was developed using a cyclical design 
process involving several iterations of adding and testing system components. The 
testing phase included experiments for performance evaluation and feedback from 
CMC researchers and web analysts. In this section, we describe the system in its 
completed state. While the system supports several tasks, information types, features, 
feature selection methods, and visualization techniques, the two major components 
of CyberGate are Writeprints and Ink Blots. We first present an overview of the 
CyberGate system based on our design framework and then provide in-depth details 
of the Writeprint and Ink Blot techniques. Figure 12.3 shows an overview of the 
system design. 



5.1 Information Types and Features 

CyberGate supports several information types, including topics, sentiments, affects, 
style, and genres. Event information was not included since the current state of the 
art for event detection is insufficient for effectively capturing and representing 
events (Allan et al. 1998). In order to enable the capturing of such a breadth of infor- 
mation, several language and processing resources were included. These include 
language resources such as sentiment and affect lexicons, word lists, and the 
WordNet thesaurus (Fellbaum 1998). Embedded processing resources include 
n-grams, statistical features (Abbasi and Chen 2005; Zheng et al. 2006), parts of 
speech, noun phrases, and named entities (McDonald et al. 2004). 
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Fig. 12.3 CyberGate system design 



3 All features b Projection C Ranking 



No. 


Description 


Usaqe 


Mean 


0 


MARK 


0.726 


0.054 
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10 
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11 
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0.06 
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0.309 


0.013 


0.305 


0.0040 


0.0080 
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Fig. 12.4 CyberGate feature selection examples: (a) shows the complete set of all features, 
(b) shows the top two dimensions of the PCA projections (Ex and Ey) and (c) shows the decision 
tree model rankings 



5.2 Feature Selection 



CyberGate uses both ranking and projection-based feature reduction methods. For 
feature ranking, it uses information gain (IG) and decision tree models (DTM). 
Both of these methods have been shown to be effective for textual feature selection 
(Forman 2003; Efron et al. 2004; Abbasi and Chen 2005). For projection, it uses 
PCA and MDS for lower dimension feature transformations. PCA and MDS have 
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both been previously used for textual feature reduction (Abbasi and Chen 2006; 
Huang et al. 2005). Figure 12.4 shows examples of the feature selection techniques 
used in CyberGate. The table on the left (a) shows the complete set of features, 
while (b) shows the two-dimensional PCA projections (Ex and Ey) and (c) shows 
the decision tree model rankings where each feature is assigned a weight. Higher 
weights indicate greater feature rank. 



5.3 Visualization 

CyberGate includes multidimensional and text overlay-based visual representa- 
tions. Multidimensional visualizations (shown in Fig. 12.5) include Writeprints 
(Fig. 12.5a) to show usage variation and parallel coordinates (Fig. 12.5b) to show 
feature similarities across messages or text windows (Abbasi and Chen 2006). Each 
circle in Writeprints denotes a single message or text window projected using prin- 
cipal component analysis (PCA). The blue polygonal lines in parallel coordinates 
also represent individual messages or text windows. The selected Writeprint point 
corresponds to the selected parallel coordinate’s polygonal line. The intersection 
between a polygonal line and a vertical axis in parallel coordinates represents the 
occurrence frequency of that feature in that particular message. For example, the 
selected message in Fig. 12.5b has very high occurrence of feature #7 (occurs 21 
times). 

CyberGate also utilizes MDS plots (Fig. 12.5d) to show overall feature similari- 
ties, while radar charts (Fig. 12.5c) are used to compare feature occurrence statistics 
across authors. The radar chart shown compares the selected author against another 
author and the mean normalized usage frequencies for a particular set of features 
(which are numbered along the perimeter). The MDS plot in Fig. 12. 5d shows fea- 
tures projected based on occurrence similarity for the bag-of-words features. We 
can see one large cluster and two smaller ones in addition to 3-4 features that are on 
their own. These features (e.g., “services”) do not frequently co-occur with any of 
the three clusters. 

The CyberGate system also features a couple of text overlay techniques (shown 
in Fig. 12.6). Text annotation simply highlights key features in the text (Cunningham 
2002). Figure 12.6a shows the text annotation view in which the bag-of-words fea- 
tures are highlighted in blue, while the selected feature (“CounselEnron”) is high- 
lighted in red. Ink Blots (Fig. 12.6b) superimposes colored circles (blots) onto text 
for key features as identified by the particular underlying feature ranking method 
incorporated. The size of the blot indicates the feature rank/weight (based on the 
feature ranking technique). Features that are more unique to a particular author have 
higher weights than features that are equally common (less interesting) across 
authors. The color indicates the author’s usage of the particular feature (red = high, 
blue = low, yellow = medium). The selected feature (again, “CounselEnron”) is high- 
lighted with a black circle. This particular feature is represented with large red blots 
indicating that the feature has a high weight (rarely used by others, unique to this 
author) and is frequently used by this author. 



244 



12 CyberGate Visualization 



a Writeprints 

Two dimensional PCA projections based 
on feature occurrences. Each circle de- 
notes a single message. Selected mess- 
age is highlighted in pink. Writeprints 
show feature usage/occurrence variation 
patterns. Greater variation results in 
more sporadic patterns. 




C Radar Charts 

Chart shows normalized feature usage 
frequencies. Blue line represents 
author’s average usage, red line 
indicates mean usage across all authors, 
and green line is another author (being 
compared against). The numbers 
represent feature numbers. Selected 
feature is highlighted (#6). 




b Parallel Coordinates 

Parallel vertical lines represent features. 
Bolded numbers are feature numbers 
(0-15). Smaller numbers above and 
below feature lines denote feature 
range. Blue polygonal lines represent 
messages. Selected message is 
highlighted in red. Selected feature is 
highlighted in pink (#2). 



0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | 

33 29 29 25 23 21 20 21 19 18 17 19 24 14 15 19 




d MDS Plots 



MDS algorithm used to project features 
into two-dimensional space based on 
occurrence similarity. Each circle 
denotes a feature. Closer features have 
higher co-occurrence. Labels represent 
feature descriptions. Selected feature is 
highlighted in pink (the term 
“services”). 




Fig. 12.5 Multidimensional views in CyberGate 
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a Text Annotation View 

Feature occurrences are highlighted in 
blue. The selected bag-of-words 
feature is highlighted in red 
(“CounselEnron”). 



b Ink Blot View 

Colored circles (blots) superimposed 
onto feature occurrence locations in text. 
Blot size and color indicates feature 
importance and usage. Selected feature’s 
blots are highlighted with black circles. 
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Fig. 12.6 Text views in CyberGate 



5.4 Writeprints and Ink Blots 

CyberGate includes the Writeprint and Ink Blot techniques, which are the core com- 
ponents driving the system’s analysis and categorization functions. These tech- 
niques epitomize the essence of the proposed design framework: representational 
richness and methodological triangulation. With respect to representational 
richness, Writeprints and Ink Blots can incorporate a wide range of features repre- 
senting various information types. Both techniques also utilize feature selection and 
visualization. Writeprints uses principal component analysis (PCA) with a sliding 
window algorithm to create two-dimensional plots that accentuate feature usage 
variation. Ink Blots uses decision tree models (DTM) to select features which are 
superimposed onto text to show usage frequencies as they occur within their textual 
structure. The techniques support methodological triangulation in the sense that 
Writeprints uses a projection method for feature selection along with a reduced 
dimensionality visual representation of multidimensional features. Ink Blots uses a 
ranking method for feature selection along with a text overlay visual presentation 
format. Writeprints is better suited to present a broad overview across a large num- 
ber of features, while Ink Blots is more geared to show detailed examples of feature 
occurrences. Both techniques can be used for text categorization and analysis. 
Specific details of Writeprints and Ink Blots are presented below. 



